CSC 558 - Data Mining and Predictive Analytics II, Fall 2024
    Course page

Final project presentations:
Spring 2018 we had 9 students in the course, and talks ran until 10 PM.
Spring 2020 we had 18 students, so we had to spread talks over the last day of class + the final exam period.
Fall 2021 we had 22 students. Spring 2023 we had 12 students.
This semester we have 30 students. We usually went 15 minutes per presentation in the past.
We will go 15 students on 2 nights for 10 minutes each = 2.5 hours + a break, STARTING AT 6.
We need to start on time. I will warn via Chat at 7 & 8 minutes. Please leave time for question(s).
Save detail for the paper. The talk is an advertisement to read the paper.
Paste any necessary URLs into the reference section of the paper.

Each presentation is scheduled for 10 minutes that include time for questions, answers, and changing presenters.
Pace your talk for 8 minutes at the most. I will signal via Chat at the 3 minutes and 2 minutes before end time,
will signal via mic at 0.5 minute if needed. Everyone will present via screen sharing on the course Zoom page.
Please let me know ASAP if you need a change of date or time.

Below are approved topics as they arrive. Students get to pick presentation time once I have a topic like below.

Jovonni DeJesus   Crimes - 2001 to Present from the Chicago Data Portal
Arek Gebka            Predictive Analysis for Grand Theft Auto 6 Metrics
Joel Tulanowski     Counter Strike 2 Gaming Statistics Analysis

December 5, 6-8:50 PM
6:00                Reiley Walther          International Rugby Match Outcome Predictor
6:10                Julia Craft                 Analyze how canine breed characteristics correlate with health outcomes.
6:20                Christopher Cohen   Correlating air quality (pollution levels) to Lung Cancer Cases in 2022.
6:30                Nathaniel Moyer       Correlation of other environmental factors to light pollution.
6:40                Nina Schnyder          Correlating weather station data (dis)agreements across two locations.
6:50                Mark Licina               LondonAir Pollution and air quality data from 2020 to 2024 (maybe KNIME).
7:00                Patrick Perrin            Analyzing Automated Driving Crash Data.
7:10                Jarred Reccek          Predict Online Gaming Behavior Dataset
7:20                break                      Usually a few talks run a minute over, so this may be as low as 10 minutes.
7:35                Felicity Brown         Correlating literacy rates in U.S. by state to unemployment rates, crime, and school funding.
7:45                Dominic Derafelo    Economic indicator effects of national debt for North American Countries -> debt growth / decline.
7:55                Ricky Oundo            Ranking of Korean TV shows based on year, reviews, engaging activities etc.
8:05                Joe Jardine              Golf tournament "missed cuts" as a function of other golfing attributes.
8:15
8:25
8:35

December 12, 6-8:50 PM
6:00               Travis Melcher          Data Analysis of Hersheypark Annual Ride Queues.
6:10               Vincent Allen             Factors in U.S. Dept Ed college scorecard ->  student outcomes.
6:20               Thaddeus Hall           Amazon US Customer Review Dataset.
6:30               Delijah Joseph          Correlating laptop specifications and prices.
6:40               Noelle Cheh              Similarities or how to apply Hertzbleed attacks to how fMRI machines work to study the brain.
6:50               Gavin Wolff               National Hockey League goal data as a function of other game attributes.
7:00               Zoryana Duda           Correlating printing attributes -> ink usage for the HP Designjet Z6 24".
7:10               Kenneth Au               Predicting Election Outcomes Based on Demographics
7:20              break                       Usually a few talks run a minute over, so this may be as low as 10 minutes.
7:35     
7:45              Damilare Ogunsesan Analyze Salary as a function of science / tech employment attributes.
7:55              Zachary Andruchowitz Prediction of Gold Glove award winners in Major League Baseball.
8:05              Josh Cacayan         How music impacts mental health as a function of listening attributes.
8:15              Andy Goldstein       How prices and economic conditions affect sales of gaming consoles.
8:25
8:35              Jacob Wolf     Correlation between spin rate pitches in baseball and the number of hits by the batters.
                               (^^^Pre-recorded on December 12 due to a class timing conflict.)
 
Assignment 4 (stage 1 of 2) is due by the end of November 26 via D2L.

Final Mini-Research Project (Assignment 5) is due by end of December 11,
with an in-class or on-line RTVC presentation of 10 minutes on per above.
You must have your presentation slides, paper, and other files to me by end of December 11.
Spring 2023 first-place paper and slides.
    Correlating limestone-rich versus limestone-poor stream beds in PA, including multiple
    sensor-recorded attributes, to dissolved oxygen and/or pH concentration in the water.
Spring 2023 second-place paper and slides.
    Analyzing Education and Poverty’s Impact on Violent Crime Rate in Pennsylvania
Ms. Pagan's data collection and preparation work was extensive, accurate,
and successful. She identified poverty as a prime indicator of violent
crime rates, and suggested inclusion of public funding and down-scaled
analysis for soliciting donations & volunteers as areas for potential
future work.
Spring 2023 third-place paper and slides.
    Superbowl Advertisement Data Analysis
Data preparation was more straightforward for this project than the above two,
but it required custom, justified attribute and instance removal and
addition. The analytical steps using tools and techniques outside the
scope of prior CSC458 and CSC558 projects serve as tie breakers in earning
this work third place, namely, use of Excel pivot tables (very intelligible
for the presentation), t-Test and Anova.
Also, the project does have good potential for commercial use.
The paper and slides from fall 2021 that won the Liquid Interactive Award.
    Data Collection, Cleaning, and Analysis of Gamer at Kutztown and their social habits.
    Especially important was Connor's use of techniques we had not used & inspiring another student to use them.
Here are the slides from the 2020 project that won the Liquid Award.
Here are the paper and slides from the 2018 project that won the Liquid Award.

Please use the D2L Assignment page for all materials by end of ???.
    You can turn in a text file containing the URL of your slides if they are on-line.
Please ZIP multiple big files together into a single zip file.

Use D2L to turn in materials.

*************** Some Q&A about Assignment 4:

I had a question about the format/depth of assignment 4. Here is my answer, with editing, and some additional perspective.

Please write complete sentences and paragraphs, and supply some illustrations if appropriate, somewhat similar to writing an analysis/design doc for a software engineering course. This should be a doc that you might send on to colleagues who were going to implement your plan. In Bell Labs we'd call that a Technical Memorandum. Just think back to the undergrad Software Engineering I course. This should be a document you'd be ready to publish to colleagues. Use the section 4 outline below.

Use my writing guidelines.

Write this up as though you were looking to someone to fund a contracted data analysis project. This is a professional-quality technical document, not just an outline. My outline is just that, a structural starting point. Write a meaty technical document that fleshes out everything in 4.2. This doc counts as much as the final deliverable. While I have not set a page requirement, the number of pages should be commensurate with the complexity and importance of the task.

*****************************************************************************

I am loosening the "there must have been no prior analysis" constraint somewhat. If there is previous analysis available, then the student can do additional analysis using different (new) approaches and use results to confirm or refute parts of the original analysis. Also, this approach of extending existing analysis will not disqualify the student from the Liquid prize, although starting analysis from scratch does earn some points for the prize. See more on the Liquid prize below in this section.

Assignment 4 due ???: Identify a dataset and goal for the project, obtain the data, check and clean it as necessary, and document your goals, your steps, and the relevance of the project to commercial or research application. Dr. Parson must approve the dataset. Identify in your documentation whether this is a fresh dataset with no prior analysis, or whether you are extending existing analysis, confirming or refuting parts of that analysis by using data modeling techniques not used in the original analysis. Data cleaning may just be a matter of converting a comma-separated-value (CSV) file into ARFF format and using AddExpression or similar filters to create derived attributes. You may use a tool other than Weka. Use D2L Assessment -> Assignment 4 for turning in assignment 4 by the end of November 26 . Deliverables:
    4.1 An ARFF file or other tool-specific data file with cleaned data and ALL anticipated derived attributes.
            "Derived attributes" may be extracted using a script (such as Python), Weka AddExpression, or similar.
    4.2 A PDF file that documents the following items. Use these section numbers.
            4.2.a What is the source of your data? Include any links or references to the data source.
                You can include the raw data if it is not too big for D2L. A URL in the write-up is good enough.
            4.2.b What is your intended goal in analyzing this data set? Are you extending previous analysis or starting new analysis?
            4.2.c What steps have you taken so far to get the dataset to its state for 4.1 above? What problems did you encounter?
            4.2.d How could the results of the analysis be used in a commercial or research setting?
            4.2.e What machine learning / modeling techniques do you anticipate using? Nominal classification, numeric estimation, other?
                        How do(es) the planned modeling technique(s) relate to 4.2.d?
                        Identify the modeling tool or tools you plan to use.
            4.2.f Document any other aspect of the project that you feel is important to communicate.
            4.2.g Use clear, descriptive writing. Use my writing guidelines 1-7. Get someone to proofread your doc (not me).
    Counting 4.1 and 4.2.a through 4.2.g, there are 8 rubrics, each worth 12.5% of the assignment 4 grade.

Assignment 5 as  described below with the current changes in this section, due by ???.
    Deliverables:
    5.1 COMPRESSED ARFF file(s) or other tool-specific data file(s) as modified during this analysis phase.
    5.2 A PDF file that documents the following items. Use your original document and add sections to it using these section numbers.
            5.2.a What additional data did you collect during analysis, if any? Include any links or references to the data source.
                You can include compressed data files to D2L if it accepts them. A URL is good enough.
            5.2.b Did you achieve your intended goal in analyzing this data set? Explain how analysis shows goal achievement or refutation.
                    Include classification results and explain how they achieve, refute, or otherwise relate to your goals.
            5.2.c What machine learning/modeling steps have you taken?
                    Show classification / regression results. Show filtering used. Explain it using the detail that I use in assignments 1-3 solutions.
                    What problems did you encounter?
            5.2.d Use SMO, or SMOreg, or MultiLayerPerceptron, or clustering, OR at least one other technique not used in assignments 1-3.
                    Give results and explain how this step relates to earlier steps.
            5.2.e Revise explaining how could the results of the analysis be used in a commercial or research setting?
            5.2.f Document any other aspect of the project that you feel is important to communicate.
            5.2.g Is for using clear, descriptive writing. Use my writing guideline 1-7. Get someone to proofread your doc (not me).
        5.3 A clear, 15-minute-bound presentation to the class on December 6 or 13 as previously explained below.
Project 5 grading rubrics:
    The 5.3 presentation counts for 20% of this project. Make it clear. Be ready to answer questions.
        If in the classroom, repeat the questions for RTVC students before answering them.
    The remaining 80% distribute as 10 points each for 5.1 and 5.2.a through 5.2.g.

Liquid Interactive Award of $1000 for the best overall result for Assignment 5.
    The winner agrees to participate in a public relations event to promote KU's data analytics program, and to an externship at Liquid.
    I will evaluate Assignment 5 rubrics and the 4.x sections of the document, noting that the PDF of 5.2 includes the write-up of 4.2.
    We will use the grading rubrics of section 5 above as our baseline. Projects 4 & 5 must meet deadlines to qualify for the award.
    Tie breaking activities include (with equal weight):
        I. Performing original analysis on a dataset that you formulated from raw data sources using your own ideas.
            If no tied-for-grade project does that, then scripting your own data cleaning in Python or R is a tie breaker.
        II. Discovering significantly more than your initial goal, where significant is roughly twice as much as the goal.
        III. Overall ambition, i.e., not just going with the simplest project that will get the grade.
        IV. Conception and creation of a project that could be carried forward into a graduate research project or a commercial application.
    You do not need to perform
tie breaking activities to get 100% on assignment 5. These are just for breaking assignment 5 grade ties.

Dr. Parson                     Bash Shell Scripting for Data Science for first hour November 29.

                                      I plan to make data-project-specific components plugins into a Python framework.
                                          https://scikit-learn.org/stable/
                                          1. Load training data and test data.
                                          2. Preprocess -> filter data.
                                          3. Classify or regress or cluster using multiple alternative machine learning libraries.
                                          4. Extract & report results in common format, optionally generating graphics.

*****************************************************************************


Project requirements:

1. Pick a data source and a dataset that interests you. You can use an existing ARFF file or CSV file
    for which there are NO POSTED SOLUTIONS, or you can use a data source that requires you
    to clean and format the data for your tool of choice. You can use Weka, KNIME, RapidMiner,
    or another open source tool that I can run. I must approve your tool choice if it is not Weka, and
    you must include link to your data source and any related documentation. Using a data source
    for which there are posted solutions, or absence of a link to your data source, earns 0%.
    Get my approval on the data source & dataset as early as possible.

Here are some data sources to consider. You can also find your own. Some of these links are unverified as of 2/2020.

50 years of Hawk Mountain data concerning raptor migration.
        In 2019 csc458 we analyzed broad wing hawk counts for 2017 & 2018. Now we have 50 years of data.
       
Dr. Goodrich of Hawk Mountain is interested in correlating climate change with raptor observations.
        This is an open-ended project that could become a CSC570 Independent Study.

Kaggle    Data competition site with lots of open problems.
        https://www.kaggle.com/

http://waterdata.usgs.gov/nwis is a U.S. Geological Survey source of data that we used in 2018 csc458 for streamflow data.
        I started looking at doing time series analysis for Dissolved Oxygen -> Chlorophyll -> pH level relationships
        in January, but didn't make much headway because the pH levels were near constant. However, I did not
        use site number as a major sort key, with time as a minor sort key, so the sites got mingled inappropriately.
        Some adventurous soul could resume that project.

Chincoteague Bay Field Station
    NPS estuarine water quality data from our long-term monitoring sites throughout Sinepuxent and Chincoteague Bays should be available for direct download through the EPA-STORET database.
    I (Parson) am interested in doing some field work there in the future.
    Kutztown's page on working at this center.
    The web address for the NPS Research Permitting System is: https://irma.nps.gov/rprs.
    http://www.cbfieldstation.org/ is a related web site (not data).

Thanks to Geoffrey Pitman for forwarding the following links!

1. World Bank Open Data Datasets covering population demographics and a huge number of economic and development indicators from across the world.
2. IMF Data The International Monetary Fund publishes data on international finances, debt rates, foreign exchange reserves, commodity prices and investments.
3. The US National Center for Education Statistics Data on educational institutions and education demographics from the US and around the world.
4. The UK Data Centre The UK’s largest collection of social, economic and population data.
FiveThirtyEight A large number of polls providing data on public opinion of political and sporting issues.
6. FBI Uniform Crime Reporting The FBI is responsible for compiling and publishing national crime statistics, with free data available at national, state and county level.
7. Bureau of Justice Here you can find data on law enforcement agencies, jails, parole and probation agencies and courts.
8. Qlick Data Market Offers a free package with access to datasets covering world population, currencies, development indicators and weather data. 
9. NASA Exoplanet Archive Public datasets covering planets and stars gathered by NASA’s space exploration missions.
10. UN Comtrade Database Statistics compiled and published by the United Nations on international trade. Includes Comtrade Lab which is a showcase of how cutting edge analytics and tools are used to extract value from the data.
11. Financial Times Market Data Up to date information on financial markets from around the world, including stock price indexes, commodities and foreign exchange.
12. Google Trends Examine and analyze data on internet search activity and trending news stories around the world.
13. Twitter The advantage Twitter has over the others are that most conversations are public. This means that huge amounts of data is available through their API on who is talking about what, where, when and why.
14. Google Scholar Entire texts of academic papers, journals, books and legal case law.
15. Instagram As with Twitter, Instagram posts and conversations are public by default. Their APIs allow likes, mentions and business details to be analyzed.
16. OpenCorporates The world’s largest open database of companies.
17. Glassdoor API Information about job vacancies, candidates, salaries and employee satisfaction is available through their developer API.
18. IMDB Datasets Datasets in a number of formats drawn from the web’s largest resource on movies, television and people working in those industries.
19. OpenLibrary Data Dumps Datasets on books including catalogues from libraries around the world
20. Labelled Faces in the Wild 13,000 collated and labelled images of human faces, for use in developing applications involving facial recognition.
21. Microsoft Marco Microsoft’s open machine learning datasets for training systems in reading comprehension and question answering.
22. Machine Learning Dataset Repository Collection of open datasets contributed by data scientists involved in machine learning projects.
23. EBay Market Data Insights Data on millions of online sales and auctions from Ebay
24. Natural History Museum Data Portal Information on nearly 4 million historical specimens in the London museum’s collection, as well as scientific sound recordings of the natural world.
25. CERN Open Data More than one petabyte of data from particle physics experiments carried out by CERN.
26. One Million Audio Cover Images Dataset hosted at archive.org covering music released around the world, for use in image processing research
27. Complete Public Reddit Comments Corpus Over one billion public comments posted to Reddit between 2007 and 2015, for training language algorithms
28. Microsoft Azure Data Markets Free Datasets Freely available datasets covering everything from agriculture to weather
29. Irish Electric Vehicle Charge Point Status Collates data from the body which oversees the network of EV charge points across the Republic of Ireland and Northern Ireland.  
30. LondonAir Pollution and air quality data from across London
       
2. Analyze that dataset using techniques we have learned this semester.
    You must find at least one pattern in the data that does not use tagged attributes for non-target attributes.
    You can use tagged attributes as non-target attributes during the initial phase as we have for assignment 3.
    You must eliminate tagged attributes except for the target attribute for the final, data-based analysis.
    The target attribute may be a tagged attribute (likely). This is up to you.

3. Document your analysis stages and results similar in format to my solution handouts. Document your work.
    This must be a PDF paper. You do not need to use the Q1 .. Qn format, although you can.
    I want to see what you tried.
    I want to see what approaches worked, why they worked, and what they found.
         This is the main point. Find at least one non-obvious pattern / correlation, and explain why it is significant.
    I want to see what approaches did not work, and why. You get credit for trying things.
    Include a summary of how your findings would be relevant in an industrial or research application of data analytics.

4. Make a ??-minute presentation in the ?? class -- ?? minutes include Q&A and setup/tear down time, summarizing the above.
    You can do this in class or via Zoom
    I will flip slides if you do it via Zoom.

5. You must have the above PDF doc (3), your slides (4), and your COMPRESSED dataset D2L to me via  by end of December 5.
    You can turn in a text file containing the URL of your slides if they are on-line.
    10% penalty for anything missing or late.
    Send me everything via acad.