*************** This part added 4/23/2018 in response to a
question about Assignment 4:
I had a question about the format/depth of assignment 4.
Here is my answer, with editing, and some additional
perspective.
Please write complete sentences and paragraphs, and supply
some illustrations if appropriate, somewhat similar to
writing an analysis/design doc for a software engineering
course. This should be a doc that you might send on to
colleagues who were going to implement your plan. In Bell
Labs we'd call that a Technical Memorandum. Just
think back to the undergrad Software Engineering I
course. This should be a document you'd be ready to publish
to colleagues. Use the section 4 outline below.
Use my writing
guidelines.
Write this up as though you were looking to someone to fund
a contracted data analysis project. This is a
professional-quality technical document, not just an
outline. My outline is just that, a structural starting
point. Write a meaty technical document that fleshes out
everything in 4.2. This doc counts as much as the final
deliverable. While I have not set a page requirement, the
number of pages should be commensurate with the complexity
and importance of the task.
*****************************************************************************
I am loosening the "there must have been no prior analysis"
constraint somewhat. If there is previous analysis available,
then the student can do additional analysis using different
(new) approaches and use results to confirm or refute parts of
the original analysis. Also, this approach of extending existing
analysis will not disqualify the student from the Liquid prize,
although starting analysis from scratch does earn some points
for the prize. See more on the Liquid prize below in this
section.
Assignment 4 due April 5:
Identify a dataset and goal for the project, obtain the data,
check and clean it as necessary, and document your goals, your
steps, and the relevance of the project to commercial or
research application. Dr. Parson must approve the dataset.
Identify in your documentation whether this is a fresh dataset
with no prior analysis, or whether you are extending existing
analysis, confirming or refuting parts of that analysis by using
data modeling techniques not used in the original analysis. Data
cleaning may just be a matter of converting a
comma-separated-value (CSV) file into ARFF format and using
AddExpression or similar filters to create derived attributes.
You may use a tool other than Weka. Project file ~parson/DataMine/csc558spring2020assn4.zip
on acad includes the make turnitin mechanism for turning
in assignment 4 by the end of April TBD. Deliverables:
4.1 An ARFF file or other
tool-specific data file with cleaned data and ALL
anticipated derived attributes.
"Derived attributes" may be extracted using a script (such as
Python), Weka AddExpression, or similar.
4.2 A PDF file that documents
the following items. Use these section numbers.
4.2.a
What is the source of your data? Include any links or references
to the data source.
You can include the raw data with make
turnitin if it is not too big for your account. A URL is
good enough.
4.2.b
What is your intended goal in analyzing this data set? Are you
extending previous analysis or starting new analysis?
4.2.c
What steps have you taken so far to get the dataset to its state
for 4.1 above? What problems did you encounter?
4.2.d
How could the results of the analysis be used in a commercial or
research setting?
4.2.e
What machine learning / modeling techniques do you anticipate
using? Nominal classification, numeric estimation, other?
How
do(es) the planned modeling technique(s) relate to 4.2.d?
Identify the modeling tool or tools you plan to use.
4.2.f
Document any other aspect of the project that you feel is
important to communicate.
4.2.g
Use clear, descriptive writing. Use
my writing guidelines 1-7. Get someone to proofread your
doc (not me).
Counting 4.1 and 4.2.a through 4.2.g, there
are 8 rubrics, each worth 12.5% of the assignment 4 grade.
Assignment 5 as
described below with the current changes in this section, due
by end of April 26.
Project file ~parson/DataMine/csc558spring2020final.zip
on acad includes the make turnitin mechanism for turning
in assignment 5 by the end of April 26. Deliverables:
5.1 ARFF file(s) or other tool-specific
data file(s) as modified during this analysis phase.
5.2 A PDF file that documents the
following items. Use your original document and add sections
to it using these section numbers.
5.2.a What additional data did you collect during analysis, if
any? Include any links or references to the data source.
You can include the new raw data with make turnitin if
it is not too big for your account. A URL is good enough.
5.2.b Did you achieve your intended goal in analyzing this data
set? Explain how analysis shows goal achievement or refutation.
Include classification
results and explain how they achieve, refute, or otherwise
relate to your goals.
5.2.c What machine learning/modeling steps have you taken?
Show classification /
regression results. Show filtering used. Explain it using the
detail that I use in assignments 1-3 solutions.
What problems did you
encounter?
5.2.d
Use SMO, or SMOreg, or MultiLayerPerceptron, or clustering, OR
at least one other technique not used in assignments 1-3.
Give results and explain
how this step relates to earlier steps.
5.2.e Revise explaining how could the results of the analysis be
used in a commercial or research setting?
5.2.f Document any other aspect of the project that you feel is
important to communicate.
5.2.g Is for using clear, descriptive writing. Use my writing
guideline 1-7. Get someone to proofread your doc (not me).
5.3 A clear, 15-minute-bound
presentation to the class on April 28 or May 5 as
previously explained below.
Project 5 grading rubrics:
The 5.3 presentation counts for 20% of
this project. Make it clear. Be ready to answer questions.
If in the classroom,
repeat the questions for RTVC students before answering them.
The remaining 80% distribute as 10
points each for 5.1 and 5.2.a through 5.2.g.
Liquid Interactive
Award of $1000 for the best overall result for Assignment
5.
The winner agrees to participate in a public
relations event to promote KU's data analytics program, and to
an externship at Liquid.
I will evaluate Assignment 5 rubrics and the
4.x sections of the document, noting that the PDF of 5.2
includes the write-up of 4.2.
We will use the grading rubrics of section 5
above as our baseline. Projects 4 & 5 must meet deadlines
to qualify for the award.
Tie breaking activities include (with
equal weight):
I. Performing original
analysis on a dataset that you formulated from raw data sources
using your own ideas.
If no
tied-for-grade project does that, then scripting your own data
cleaning in Python or R is a tie breaker.
II. Discovering
significantly more than your initial goal, where significant is
roughly twice as much as the goal.
III. Overall ambition,
i.e., not just going with the simplest project that will get the
grade.
IV. Conception and
creation of a project that could be carried forward into a
graduate research project or a commercial application.
You do not need to perform
tie
breaking activities to get 100% on assignment
5. These are just for breaking assignment 5 grade ties.
Student topics:
Bryan
McNally & Tyler Blankenbiller Dr.
Goodrich's time series on weather changes over decades &
correlations to migration times.
Faith
Evans
Time-based visualization of North
Lookout
Manny Douge
Covid-19
data set hosted by John Hopkins
Patrick Earl
May also analyze a different
angle of the Covid-19 debacle
Angela Kozma
Flavors
of Cacao
Sean Kinneer
Factors in mass mortality among animals
Brandon Kresge
Rain in Australia Kaggle dataset
Shaan Badlu
Divorce rate factors or a mental health
dataset
Jishitha Nambiar Netflix
dataset
Kelly Fox
"World Happiness Report:
Happiness scored according to economic production, social
support, etc." Kaggle
Vivian Azar
NYC
AirBnb data from Kaggle
Eric Burgos
Big
Five Personality Test OR Residential
Energy Consumption Survey (both Kaggle)
Nicholas Paolella Asteroid
dataset
Lori Bogumil
Top
50 Spotify Songs of 2019
Noah Cregger
Music
Features dataset, source
of data
Tamara Jennings Patterns in
Major League Baseball, with data
from here, and
from here, and possibly a related dataset.
Christopher Kelly University
rankings and alumni
income
*****************************************************************************
Project requirements:
1. Pick a data source and a dataset that interests you.
You can use an existing ARFF file or CSV file
for which there are NO POSTED SOLUTIONS, or
you can use a data source that requires you
to clean and format the data for your tool of
choice. You can use Weka, KNIME, RapidMiner,
or another open source tool that I can run. I
must approve your tool choice if it is not Weka, and
you must include link to your data source and
any related documentation. Using a data source
for which there are posted solutions, or
absence of a link to your data source, earns 0%.
Get my approval on the data source
& dataset as early as possible.
Here are some data sources to consider. You can also find your
own. Some of these links are unverified as of 2/2020.
50 years of Hawk Mountain data concerning raptor migration.
Last semester in csc458 we
analyzed broad wing hawk counts for 2017 & 2018. Now we have
50 years of data.
One project could be
student-selected analysis of hawk counts or weather patterns.
Dr. Laurie Goodrich wrote
(Time-based
analysis of North Lookout):
A. "I WOULD SAY BROADWINGS MAY BE THE HARDEST SPECIES TO WORK
WITH AS THEY ARE SO UNPREDICTABLE.
BUT IF YOU WANTED TO HAVE CLASS LOOK AT A MORE
PREDICTABLE ONE .. SSHA MIGHT BE GOOD.
ANOTHER ASPECT IS HOW WEATHER HAS CHANGED OVER
TIME. NOT LOOKING AT THAT, BUT KNOWING THAT.
ARE BIRDS
CONCENTRATING IN DIFFERENT PATTERNS NOW THAN 30 YRS AGO", also:
" I would love to have someone look at how visibility has
changed over time here, or other factors such as wind direction
and speed."
B. "The
peak season for BW is pretty well known to be between sept 10
and sept 24 or so. Our peak days are often 15, 16, 17.
Other
species might be more useful to predict especially since they
seem to be shifting. Golden Eagle, Redtailed Hawk,
Sharpshinned
Hawk, Peregrine Falcon. American Kestrel, Osprey."
C. I
also about getting data from sites to the north, so we could
correlate weather between there & Hawk Mountain
as a factor in estimating arrival times. She said getting
reliable data is hard.
D.
Time-based
visualization of North Lookout
A
different project could be time-based visualization of both
species-highlighted migrations + weather conditions
(clouds, wind speed & direction, etc.) over a map of North
Lookout over time.
There
is a possibility of getting an interactive visualization
installed at Hawk Mountain.
Adding
interactivity could be a CSC570 Independent Study for a student
in fall 2020.
Kaggle Data competition site with lots of
open problems.
https://www.kaggle.com/
http://waterdata.usgs.gov/nwis
is a U.S. Geological Survey source of data that we used
in 2018 csc458 for streamflow data.
I started looking at doing
time series analysis for Dissolved Oxygen -> Chlorophyll
-> pH level relationships
in January, but didn't
make much headway because the pH levels were near constant.
However, I did not
use site number as a major
sort key, with time as a minor sort key, so the sites got
mingled inappropriately.
Some adventurous soul
could resume that project.
Chincoteague Bay Field Station
NPS estuarine water quality data from our
long-term monitoring sites throughout Sinepuxent and
Chincoteague Bays should be available for direct download
through the EPA-STORET database.
I (Parson) am interested in doing some field
work there in the future.
Kutztown's
page on working at this center.
The web address for the NPS Research
Permitting System is: https://irma.nps.gov/rprs.
http://www.cbfieldstation.org/
is a related web site (not data).
Thanks to Geoffrey Pitman
for forwarding the following links!
1. World Bank Open Data Datasets
covering population demographics and a huge number of
economic and development indicators from across the world.
2. IMF Data The
International Monetary Fund publishes data on
international finances, debt rates, foreign exchange
reserves, commodity prices and investments.
4. The UK Data Centre The
UK’s largest collection of social, economic and population
data.
5 FiveThirtyEight A
large number of polls providing data on public opinion of
political and sporting issues.
6. FBI Uniform Crime Reporting The
FBI is responsible for compiling and publishing national
crime statistics, with free data available at national,
state and county level.
7. Bureau of Justice Here
you can find data on law enforcement agencies, jails,
parole and probation agencies and courts.
8. Qlick Data Market Offers
a free package with access to datasets covering world
population, currencies, development indicators and weather
data.
9. NASA Exoplanet
Archive Public datasets covering
planets and stars gathered by NASA’s space exploration
missions.
10. UN Comtrade Database Statistics
compiled and published by the United Nations on
international trade. Includes Comtrade Lab which is a
showcase of how cutting edge analytics and tools are used
to extract value from the data.
11. Financial Times Market Data Up
to date information on financial markets from around the
world, including stock price indexes, commodities and
foreign exchange.
12. Google Trends Examine
and analyze data on internet search activity and trending
news stories around the world.
13. Twitter The
advantage Twitter has over the others are that most
conversations are public. This means that huge amounts of
data is available through their API on who is talking
about what, where, when and why.
14. Google Scholar Entire
texts of academic papers, journals, books and legal case
law.
15. Instagram As
with Twitter, Instagram posts and conversations are public
by default. Their APIs allow likes, mentions and business
details to be analyzed.
17. Glassdoor API Information
about job vacancies, candidates, salaries and employee
satisfaction is available through their developer API.
18. IMDB Datasets Datasets
in a number of formats drawn from the web’s largest
resource on movies, television and people working in those
industries.
20. Labelled Faces in the Wild 13,000
collated and labelled images of human faces, for use in
developing applications involving facial recognition.
21. Microsoft Marco Microsoft’s
open machine learning datasets for training systems in
reading comprehension and question answering.
24. Natural History Museum Data Portal Information
on nearly 4 million historical specimens in the London
museum’s collection, as well as scientific sound
recordings of the natural world.
25. CERN Open Data More
than one petabyte of data from particle physics
experiments carried out by CERN.
30. LondonAir Pollution
and air quality data from across London
2. Analyze that dataset using techniques we
have learned this semester.
You must find at least one pattern in the
data that does not use tagged attributes for non-target
attributes.
You can use tagged attributes as non-target
attributes during the initial phase as we have for assignment 3.
You must eliminate tagged attributes except
for the target attribute for the final, data-based analysis.
The target attribute may be a tagged
attribute (likely). This is up to you.
3. Document your analysis stages and results
similar in format to my solution handouts. Document your
work.
This must be a PDF paper. You do not need to
use the Q1 .. Qn format, although you can.
I want to see what you tried.
I want to see what approaches worked, why
they worked, and what they found.
This is the main point.
Find at least one non-obvious pattern
/ correlation, and explain why it is significant.
I want to see what approaches did not work,
and why. You get credit for trying things.
Include a summary of how your
findings would be relevant in an industrial or research
application of data analytics.
4. Make a 15-minute presentation in the April 28 or May 5
class -- 15 minutes include Q&A and setup/tear down
time, summarizing the above.
You can do this in class or via Zoom
I will flip slides if you do it via Zoom.
5. You must have the above PDF doc (3), your slides (4),
and your dataset to me via
acad by end of April 26.
Also, use the D2L Assignment for CSC558 to
turn in just your slides by end of April 26.
You can turn in a text file containing the
URL of your slides if they are on-line.
10% penalty for anything missing or late.
I am supplying
~parson/DataMine/csc558spring2020final.zip with a makefile that
supports make turnitin.
Do NOT send anything other than slides
via D2L.
Send me everything via acad.