*************** Some Q&A about Assignment 4:
I had a question about the format/depth of assignment 4.
Here is my answer, with editing, and some additional
perspective.
Please write complete sentences and paragraphs, and supply
some illustrations if appropriate, somewhat similar to
writing an analysis/design doc for a software engineering
course. This should be a doc that you might send on to
colleagues who were going to implement your plan. In Bell
Labs we'd call that a Technical Memorandum. Just
think back to the undergrad Software Engineering I
course. This should be a document you'd be ready to publish
to colleagues. Use the section 4 outline below.
Use my writing guidelines.
Write this up as though you were looking to someone to fund
a contracted data analysis project. This is a
professional-quality technical document, not just an
outline. My outline is just that, a structural starting
point. Write a meaty technical document that fleshes out
everything in 4.2. This doc counts as much as the final
deliverable. While I have not set a page requirement, the
number of pages should be commensurate with the complexity
and importance of the task.
*****************************************************************************
I am loosening the "there must have been no prior analysis"
constraint somewhat. If there is previous analysis available,
then the student can do additional analysis using different
(new) approaches and use results to confirm or refute parts of
the original analysis. Also, this approach of extending existing
analysis will not disqualify the student from the Liquid prize,
although starting analysis from scratch does earn some points
for the prize. See more on the Liquid prize below in this
section.
Assignment 4 due ???: Identify
a dataset and goal for the project, obtain the data, check and
clean it as necessary, and document your goals, your steps, and
the relevance of the project to commercial or research
application. Dr. Parson must approve the dataset. Identify in
your documentation whether this is a fresh dataset with no prior
analysis, or whether you are extending existing analysis,
confirming or refuting parts of that analysis by using data
modeling techniques not used in the original analysis. Data
cleaning may just be a matter of converting a
comma-separated-value (CSV) file into ARFF format and using
AddExpression or similar filters to create derived attributes.
You may use a tool other than Weka. Use D2L Assessment ->
Assignment 4 for turning in assignment 4 by the end of November
26 . Deliverables:
4.1 An ARFF file or other
tool-specific data file with cleaned data and ALL
anticipated derived attributes.
"Derived attributes" may be extracted using a script (such as
Python), Weka AddExpression, or similar.
4.2 A PDF file that documents
the following items. Use these section numbers.
4.2.a
What is the source of your data? Include any links or references
to the data source.
You can include the raw data if it is not too
big for D2L. A URL in the write-up is good enough.
4.2.b
What is your intended goal in analyzing this data set? Are you
extending previous analysis or starting new analysis?
4.2.c
What steps have you taken so far to get the dataset to its state
for 4.1 above? What problems did you encounter?
4.2.d
How could the results of the analysis be used in a commercial or
research setting?
4.2.e
What machine learning / modeling techniques do you anticipate
using? Nominal classification, numeric estimation, other?
How
do(es) the planned modeling technique(s) relate to 4.2.d?
Identify the modeling tool or tools you plan to use.
4.2.f
Document any other aspect of the project that you feel is
important to communicate.
4.2.g
Use clear, descriptive writing. Use
my writing guidelines 1-7. Get someone to proofread your
doc (not me).
Counting 4.1 and 4.2.a through 4.2.g, there
are 8 rubrics, each worth 12.5% of the assignment 4 grade.
Assignment 5 as
described below with the current changes in this section, due
by ???.
Deliverables:
5.1 COMPRESSED ARFF file(s) or other
tool-specific data file(s) as modified during this
analysis phase.
5.2 A PDF file that documents the
following items. Use your original document and add sections
to it using these section numbers.
5.2.a What additional data did you collect during analysis, if
any? Include any links or references to the data source.
You can include compressed data files to D2L if it accepts them.
A URL is good enough.
5.2.b Did you achieve your intended goal in analyzing this data
set? Explain how analysis shows goal achievement or refutation.
Include classification
results and explain how they achieve, refute, or otherwise
relate to your goals.
5.2.c What machine learning/modeling steps have you taken?
Show classification /
regression results. Show filtering used. Explain it using the
detail that I use in assignments 1-3 solutions.
What problems did you
encounter?
5.2.d
Use SMO, or SMOreg, or MultiLayerPerceptron, or clustering, OR
at least one other technique not used in assignments 1-3.
Give results and explain
how this step relates to earlier steps.
5.2.e Revise explaining how could the results of the analysis be
used in a commercial or research setting?
5.2.f Document any other aspect of the project that you feel is
important to communicate.
5.2.g Is for using clear, descriptive writing. Use my writing
guideline 1-7. Get someone to proofread your doc (not me).
5.3 A clear, 15-minute-bound
presentation to the class on December 6 or 13 as
previously explained below.
Project 5 grading rubrics:
The 5.3 presentation counts for 20% of
this project. Make it clear. Be ready to answer questions.
If in the classroom,
repeat the questions for RTVC students before answering them.
The remaining 80% distribute as 10
points each for 5.1 and 5.2.a through 5.2.g.
Liquid Interactive
Award of $1000 for the best overall result for Assignment
5.
The winner agrees to participate in a public
relations event to promote KU's data analytics program, and to
an externship at Liquid.
I will evaluate Assignment 5 rubrics and the
4.x sections of the document, noting that the PDF of 5.2
includes the write-up of 4.2.
We will use the grading rubrics of section 5
above as our baseline. Projects 4 & 5 must meet deadlines
to qualify for the award.
Tie breaking activities include (with
equal weight):
I. Performing original
analysis on a dataset that you formulated from raw data sources
using your own ideas.
If no
tied-for-grade project does that, then scripting your own data
cleaning in Python or R is a tie breaker.
II. Discovering
significantly more than your initial goal, where significant is
roughly twice as much as the goal.
III. Overall ambition,
i.e., not just going with the simplest project that will get the
grade.
IV. Conception and
creation of a project that could be carried forward into a
graduate research project or a commercial application.
You do not need to perform
tie
breaking activities to get 100% on assignment
5. These are just for breaking assignment 5 grade ties.
Dr. Parson
Bash Shell Scripting for Data Science for first hour November
29.
I plan to make
data-project-specific components plugins into a Python
framework.
https://scikit-learn.org/stable/
1. Load training data and test data.
2. Preprocess -> filter data.
3. Classify or regress or cluster using
multiple alternative machine learning libraries.
4. Extract & report results in
common format, optionally generating graphics.
*****************************************************************************
Project requirements:
1. Pick a data source and a dataset that interests you.
You can use an existing ARFF file or CSV file
for which there are NO POSTED SOLUTIONS, or
you can use a data source that requires you
to clean and format the data for your tool of
choice. You can use Weka, KNIME, RapidMiner,
or another open source tool that I can run. I
must approve your tool choice if it is not Weka, and
you must include link to your data source and
any related documentation. Using a data source
for which there are posted solutions, or
absence of a link to your data source, earns 0%.
Get my approval on the data source
& dataset as early as possible.
Here are some data sources to consider. You can also find your
own. Some of these links are unverified as of 2/2020.
50 years of Hawk Mountain data concerning raptor migration.
In 2019 csc458 we analyzed
broad wing hawk counts for 2017 & 2018. Now we have 50 years
of data.
Dr.
Goodrich of Hawk Mountain is interested in correlating climate
change with raptor observations.
This is an open-ended
project that could become a CSC570 Independent Study.
Kaggle Data competition site with lots of
open problems.
https://www.kaggle.com/
http://waterdata.usgs.gov/nwis
is a U.S. Geological Survey source of data that we used
in 2018 csc458 for streamflow data.
I started looking at doing
time series analysis for Dissolved Oxygen -> Chlorophyll
-> pH level relationships
in January, but didn't
make much headway because the pH levels were near constant.
However, I did not
use site number as a major
sort key, with time as a minor sort key, so the sites got
mingled inappropriately.
Some adventurous soul
could resume that project.
Chincoteague Bay Field Station
NPS estuarine water quality data from our
long-term monitoring sites throughout Sinepuxent and
Chincoteague Bays should be available for direct download
through the EPA-STORET database.
I (Parson) am interested in doing some field
work there in the future.
Kutztown's
page on working at this center.
The web address for the NPS Research
Permitting System is: https://irma.nps.gov/rprs.
http://www.cbfieldstation.org/
is a related web site (not data).
Thanks to Geoffrey Pitman
for forwarding the following links!
1. World Bank Open Data Datasets
covering population demographics and a huge number of
economic and development indicators from across the world.
2. IMF Data The
International Monetary Fund publishes data on
international finances, debt rates, foreign exchange
reserves, commodity prices and investments.
4. The UK Data Centre The
UK’s largest collection of social, economic and population
data.
5 FiveThirtyEight A
large number of polls providing data on public opinion of
political and sporting issues.
6. FBI Uniform Crime Reporting The
FBI is responsible for compiling and publishing national
crime statistics, with free data available at national,
state and county level.
7. Bureau of Justice Here
you can find data on law enforcement agencies, jails,
parole and probation agencies and courts.
8. Qlick Data Market Offers
a free package with access to datasets covering world
population, currencies, development indicators and weather
data.
9. NASA Exoplanet
Archive Public datasets covering
planets and stars gathered by NASA’s space exploration
missions.
10. UN Comtrade Database Statistics
compiled and published by the United Nations on
international trade. Includes Comtrade Lab which is a
showcase of how cutting edge analytics and tools are used
to extract value from the data.
11. Financial Times Market Data Up
to date information on financial markets from around the
world, including stock price indexes, commodities and
foreign exchange.
12. Google Trends Examine
and analyze data on internet search activity and trending
news stories around the world.
13. Twitter The
advantage Twitter has over the others are that most
conversations are public. This means that huge amounts of
data is available through their API on who is talking
about what, where, when and why.
14. Google Scholar Entire
texts of academic papers, journals, books and legal case
law.
15. Instagram As
with Twitter, Instagram posts and conversations are public
by default. Their APIs allow likes, mentions and business
details to be analyzed.
17. Glassdoor API Information
about job vacancies, candidates, salaries and employee
satisfaction is available through their developer API.
18. IMDB Datasets Datasets
in a number of formats drawn from the web’s largest
resource on movies, television and people working in those
industries.
20. Labelled Faces in the Wild 13,000
collated and labelled images of human faces, for use in
developing applications involving facial recognition.
21. Microsoft Marco Microsoft’s
open machine learning datasets for training systems in
reading comprehension and question answering.
24. Natural History Museum Data Portal Information
on nearly 4 million historical specimens in the London
museum’s collection, as well as scientific sound
recordings of the natural world.
25. CERN Open Data More
than one petabyte of data from particle physics
experiments carried out by CERN.
30. LondonAir Pollution
and air quality data from across London
2. Analyze that dataset using techniques we
have learned this semester.
You must find at least one pattern in the
data that does not use tagged attributes for non-target
attributes.
You can use tagged attributes as non-target
attributes during the initial phase as we have for assignment 3.
You must eliminate tagged attributes except
for the target attribute for the final, data-based analysis.
The target attribute may be a tagged
attribute (likely). This is up to you.
3. Document your analysis stages and results
similar in format to my solution handouts. Document your
work.
This must be a PDF paper. You do not need to
use the Q1 .. Qn format, although you can.
I want to see what you tried.
I want to see what approaches worked, why
they worked, and what they found.
This is the main point.
Find at least one non-obvious pattern
/ correlation, and explain why it is significant.
I want to see what approaches did not work,
and why. You get credit for trying things.
Include a summary of how your
findings would be relevant in an industrial or research
application of data analytics.
4. Make a ??-minute presentation in the ?? class -- ??
minutes include Q&A and setup/tear down
time, summarizing the above.
You can do this in class or via Zoom
I will flip slides if you do it via Zoom.
5. You must have the above PDF doc (3), your slides (4),
and your COMPRESSED dataset D2L
to me via by end of December 5.
You can turn in a text file containing the
URL of your slides if they are on-line.
10% penalty for anything missing or late.
Send me everything via acad.