CSC 558, Fall 2021, Final Project Assignment

*************** Some Q&A about Assignment 4:

I had a question about the format/depth of assignment 4. Here is my answer, with editing, and some additional perspective.

Please write complete sentences and paragraphs, and supply some illustrations if appropriate, somewhat similar to writing an analysis/design doc for a software engineering course. This should be a doc that you might send on to colleagues who were going to implement your plan. In Bell Labs we'd call that a Technical Memorandum. Just think back to the undergrad Software Engineering I course. This should be a document you'd be ready to publish to colleagues. Use the section 4 outline below.

Use my writing guidelines.

Write this up as though you were looking to someone to fund a contracted data analysis project. This is a professional-quality technical document, not just an outline. My outline is just that, a structural starting point. Write a meaty technical document that fleshes out everything in 4.2. This doc counts as much as the final deliverable. While I have not set a page requirement, the number of pages should be commensurate with the complexity and importance of the task.

*****************************************************************************

I am loosening the "there must have been no prior analysis" constraint somewhat. If there is previous analysis available, then the student can do additional analysis using different (new) approaches and use results to confirm or refute parts of the original analysis. Also, this approach of extending existing analysis will not disqualify the student from the Liquid prize, although starting analysis from scratch does earn some points for the prize. See more on the Liquid prize below in this section.

Assignment 4 due November 26: Identify a dataset and goal for the project, obtain the data, check and clean it as necessary, and document your goals, your steps, and the relevance of the project to commercial or research application. Dr. Parson must approve the dataset. Identify in your documentation whether this is a fresh dataset with no prior analysis, or whether you are extending existing analysis, confirming or refuting parts of that analysis by using data modeling techniques not used in the original analysis. Data cleaning may just be a matter of converting a comma-separated-value (CSV) file into ARFF format and using AddExpression or similar filters to create derived attributes. You may use a tool other than Weka. Use D2L Assessment -> Assignment 4 for turning in assignment 4 by the end of November 26 . Deliverables:
    4.1 An ARFF file or other tool-specific data file with cleaned data and ALL anticipated derived attributes.
            "Derived attributes" may be extracted using a script (such as Python), Weka AddExpression, or similar.
    4.2 A PDF file that documents the following items. Use these section numbers.
            4.2.a What is the source of your data? Include any links or references to the data source.
                You can include the raw data if it is not too big for D2L. A URL in the write-up is good enough.
            4.2.b What is your intended goal in analyzing this data set? Are you extending previous analysis or starting new analysis?
            4.2.c What steps have you taken so far to get the dataset to its state for 4.1 above? What problems did you encounter?
            4.2.d How could the results of the analysis be used in a commercial or research setting?
            4.2.e What machine learning / modeling techniques do you anticipate using? Nominal classification, numeric estimation, other?
                        How do(es) the planned modeling technique(s) relate to 4.2.d?
                        Identify the modeling tool or tools you plan to use.
            4.2.f Document any other aspect of the project that you feel is important to communicate.
            4.2.g Use clear, descriptive writing. Use my writing guidelines 1-7. Get someone to proofread your doc (not me).
    Counting 4.1 and 4.2.a through 4.2.g, there are 8 rubrics, each worth 12.5% of the assignment 4 grade.

Assignment 5 as described below with the current changes in this section, due by end of December 5.
    Deliverables:
    5.1 COMPRESSED ARFF file(s) or other tool-specific data file(s) as modified during this analysis phase.
    5.2 A PDF file that documents the following items. Use your original document and add sections to it using these section numbers.
            5.2.a What additional data did you collect during analysis, if any? Include any links or references to the data source.
                You can include compressed data files to D2L if it accepts them. A URL is good enough.
            5.2.b Did you achieve your intended goal in analyzing this data set? Explain how analysis shows goal achievement or refutation.
                    Include classification results and explain how they achieve, refute, or otherwise relate to your goals.
            5.2.c What machine learning/modeling steps have you taken?
                    Show classification / regression results. Show filtering used. Explain it using the detail that I use in assignments 1-3 solutions.
                    What problems did you encounter?
            5.2.d Use SMO, or SMOreg, or MultiLayerPerceptron, or clustering, OR at least one other technique not used in assignments 1-3.
                    Give results and explain how this step relates to earlier steps.
            5.2.e Revise explaining how could the results of the analysis be used in a commercial or research setting?
            5.2.f Document any other aspect of the project that you feel is important to communicate.
            5.2.g Is for using clear, descriptive writing. Use my writing guideline 1-7. Get someone to proofread your doc (not me).
        5.3 A clear, 15-minute-bound presentation to the class on December 6 or 13 as previously explained below.
Project 5 grading rubrics:
    The 5.3 presentation counts for 20% of this project. Make it clear. Be ready to answer questions.
        If in the classroom, repeat the questions for RTVC students before answering them.
    The remaining 80% distribute as 10 points each for 5.1 and 5.2.a through 5.2.g.

Liquid Interactive Award of $1000 for the best overall result for Assignment 5.
    The winner agrees to participate in a public relations event to promote KU's data analytics program, and to an externship at Liquid.
    I will evaluate Assignment 5 rubrics and the 4.x sections of the document, noting that the PDF of 5.2 includes the write-up of 4.2.
    We will use the grading rubrics of section 5 above as our baseline. Projects 4 & 5 must meet deadlines to qualify for the award.
    Tie breaking activities include (with equal weight):
        I. Performing original analysis on a dataset that you formulated from raw data sources using your own ideas.
            If no tied-for-grade project does that, then scripting your own data cleaning in Python or R is a tie breaker.
        II. Discovering significantly more than your initial goal, where significant is roughly twice as much as the goal.
        III. Overall ambition, i.e., not just going with the simplest project that will get the grade.
        IV. Conception and creation of a project that could be carried forward into a graduate research project or a commercial application.
    You do not need to perform tie breaking activities to get 100% on assignment 5. These are just for breaking assignment 5 grade ties.

Dr. Parson                     Bash Shell Scripting for Data Science for first hour November 29.

                                      I plan to make data-project-specific components plugins into a Python framework.
                                          https://scikit-learn.org/stable/
                                        1. Load training data and test data.
                                        2. Preprocess -> filter data.
                                        3. Classify or regress or cluster using multiple alternative machine learning libraries.
                                        4. Extract & report results in common format, optionally generating graphics.

*****************************************************************************

Project requirements:

1. Pick a data source and a dataset that interests you. You can use an existing ARFF file or CSV file
    for which there are NO POSTED SOLUTIONS, or you can use a data source that requires you
    to clean and format the data for your tool of choice. You can use Weka, KNIME, RapidMiner,
    or another open source tool that I can run. I must approve your tool choice if it is not Weka, and
    you must include link to your data source and any related documentation. Using a data source
    for which there are posted solutions, or absence of a link to your data source, earns 0%.
    Get my approval on the data source & dataset as early as possible.

Here are some data sources to consider. You can also find your own. Some of these links are unverified as of 2/2020.

50 years of Hawk Mountain data concerning raptor migration.
        In 2019 csc458 we analyzed broad wing hawk counts for 2017 & 2018. Now we have 50 years of data.
        Dr. Goodrich of Hawk Mountain is interested in correlating climate change with raptor observations.
        This is an open-ended project that could become a CSC570 Independent Study.

Kaggle    Data competition site with lots of open problems.
        https://www.kaggle.com/

http://waterdata.usgs.gov/nwis is a U.S. Geological Survey source of data that we used in 2018 csc458 for streamflow data.
        I started looking at doing time series analysis for Dissolved Oxygen -> Chlorophyll -> pH level relationships
        in January, but didn't make much headway because the pH levels were near constant. However, I did not
        use site number as a major sort key, with time as a minor sort key, so the sites got mingled inappropriately.
        Some adventurous soul could resume that project.

Chincoteague Bay Field Station
    NPS estuarine water quality data from our long-term monitoring sites throughout Sinepuxent and Chincoteague Bays should be available for direct download through the EPA-STORET database.
    I (Parson) am interested in doing some field work there in the future.
    Kutztown's page on working at this center.
    The web address for the NPS Research Permitting System is: https://irma.nps.gov/rprs.
    http://www.cbfieldstation.org/ is a related web site (not data).

Thanks to Geoffrey Pitman for forwarding the following links!

1. World Bank Open Data Datasets covering population demographics and a huge number of economic and development indicators from across the world.

2. IMF Data The International Monetary Fund publishes data on international finances, debt rates, foreign exchange reserves, commodity prices and investments.

3. The US National Center for Education Statistics Data on educational institutions and education demographics from the US and around the world.

4. The UK Data Centre The UK’s largest collection of social, economic and population data.

5 FiveThirtyEight A large number of polls providing data on public opinion of political and sporting issues.

6. FBI Uniform Crime Reporting The FBI is responsible for compiling and publishing national crime statistics, with free data available at national, state and county level.

7. Bureau of Justice Here you can find data on law enforcement agencies, jails, parole and probation agencies and courts.

8. Qlick Data Market Offers a free package with access to datasets covering world population, currencies, development indicators and weather data.

9. NASA Exoplanet Archive Public datasets covering planets and stars gathered by NASA’s space exploration missions.

10. UN Comtrade Database Statistics compiled and published by the United Nations on international trade. Includes Comtrade Lab which is a showcase of how cutting edge analytics and tools are used to extract value from the data.

11. Financial Times Market Data Up to date information on financial markets from around the world, including stock price indexes, commodities and foreign exchange.

12. Google Trends Examine and analyze data on internet search activity and trending news stories around the world.

13. Twitter The advantage Twitter has over the others are that most conversations are public. This means that huge amounts of data is available through their API on who is talking about what, where, when and why.

14. Google Scholar Entire texts of academic papers, journals, books and legal case law.

15. Instagram As with Twitter, Instagram posts and conversations are public by default. Their APIs allow likes, mentions and business details to be analyzed.

16. OpenCorporates The world’s largest open database of companies.

17. Glassdoor API Information about job vacancies, candidates, salaries and employee satisfaction is available through their developer API.

18. IMDB Datasets Datasets in a number of formats drawn from the web’s largest resource on movies, television and people working in those industries.

19. OpenLibrary Data Dumps Datasets on books including catalogues from libraries around the world

20. Labelled Faces in the Wild 13,000 collated and labelled images of human faces, for use in developing applications involving facial recognition.

21. Microsoft Marco Microsoft’s open machine learning datasets for training systems in reading comprehension and question answering.

22. Machine Learning Dataset Repository Collection of open datasets contributed by data scientists involved in machine learning projects.

23. EBay Market Data Insights Data on millions of online sales and auctions from Ebay

24. Natural History Museum Data Portal Information on nearly 4 million historical specimens in the London museum’s collection, as well as scientific sound recordings of the natural world.

25. CERN Open Data More than one petabyte of data from particle physics experiments carried out by CERN.

26. One Million Audio Cover Images Dataset hosted at archive.org covering music released around the world, for use in image processing research

27. Complete Public Reddit Comments Corpus Over one billion public comments posted to Reddit between 2007 and 2015, for training language algorithms

28. Microsoft Azure Data Markets Free Datasets Freely available datasets covering everything from agriculture to weather

29. Irish Electric Vehicle Charge Point Status Collates data from the body which oversees the network of EV charge points across the Republic of Ireland and Northern Ireland.

30. LondonAir Pollution and air quality data from across London

2. Analyze that dataset using techniques we have learned this semester.
    You must find at least one pattern in the data that does not use tagged attributes for non-target attributes.
    You can use tagged attributes as non-target attributes during the initial phase as we have for assignment 3.
    You must eliminate tagged attributes except for the target attribute for the final, data-based analysis.
    The target attribute may be a tagged attribute (likely). This is up to you.

3. Document your analysis stages and results similar in format to my solution handouts. Document your work.
    This must be a PDF paper. You do not need to use the Q1 .. Qn format, although you can.
    I want to see what you tried.
    I want to see what approaches worked, why they worked, and what they found.
   This is the main point. Find at least one non-obvious pattern / correlation, and explain why it is significant.
    I want to see what approaches did not work, and why. You get credit for trying things.
    Include a summary of how your findings would be relevant in an industrial or research application of data analytics.

4. Make a 15-minute presentation in the December 6 or 13 class -- 15 minutes include Q&A and setup/tear down time, summarizing the above.
    You can do this in class or via Zoom
    I will flip slides if you do it via Zoom.

5. You must have the above PDF doc (3), your slides (4), and your COMPRESSED dataset D2L to me via by end of December 5.
    You can turn in a text file containing the URL of your slides if they are on-line.
    10% penalty for anything missing or late.
    Send me everything via acad.