CSC 558 - Data
Mining and Predictive Analytics II, Spring 2020, Tu 6-8:50 PM
in Old Main 158.
Dr.
Dale E. Parson
Students, can come to class or attend live on-line via Zoom.
Please read
student instructions here.
Here is the Zoom
link for attending remotely at class time.
Zoom video archives
go at the bottom of this page.
Spring 2020 Office Hours: Tu
2:30-4:30, Wed 12:00-2:00, Fri 1:30-2:30, or by appointment.
Parson office hours for last two weeks of spring 2020:
Tuesday Apr 28 2-3 PM (changed from 2:30-4:30)
Wed Apr 29 12-2 PM as usual
Fr May 1 1:30-3:30 (extra hour added)
Final exam week hours:
Tuesday May 5 2:30-4:30 PM as usual
Wed May 6 12-1 PM (one hour less)
Thursday May 7 1:30-3:30 (Friday's usual office hour
cancelled)
Real-time on-line teaching via Zoom to commence March 23
through end of semester.
You can attend interactively at
normal class time if possible via our normal link https://kutztown.zoom.us/j/914190228.
Use the Chrome browser if possible.
Click the link before March 23 to auto-install Zoom if you
haven't already.
I will post a link to a video
recording of each class within 24 hours at the bottom of
this course page as before.
My office hours will take place at
the normal times VIA THIS
DIFFERENT, OFFICE HOUR ZOOM ROOM.
Please watch your email & this
course page. If Zoom runs out of steam, I will post YouTube
videos & send email.
Deloitte is recruiting Data
Scientists (including graduate level) and Data
Analysts (undergrad only) in Mechanicsburg, PA
First day
handout (syllabus that is specific to this semester).
Textbook: Data Mining: Practical Machine Learning Tools
and Techniques, Fourth Edition, Witten, et. al., ISBN
978-0128042915. You can buy a discounted copy of the 3rd
Edition at the KU Book Store -- either edition is fine.
There are on-line copies of the Third Edition available in
Rohrbach Library.
There are 4 courses or 3 courses + (research or internship) in
our Graduate
Data Analytics certificate program.
You need to register free for the program,
and you can use courses from a CSIT master's program.
Talk to me if you are interested.
I commit to using each student's preferred name and preferred
gender pronoun. Feel free to contact me in private if I
make mistakes in pronunciation, name, gender, or anything
else. Thanks! Here is a poll to which you can reply
privately on paper or via email.
Gender-Based Crimes
Educators must report incidents of gender-based crimes,
including sexual assault, sexual harassment, stalking, dating
violence, and domestic violence. If a student discloses
such incidents to me during class or in a course assignment, I
am not required to report the disclosure, unless the student
was a minor at the time the incident occurred.
Regardless of the student’s age, if the incident is disclosed
to me outside the classroom setting or a course assignment, I
am required by law to report the disclosure, including
relevant details, such as the names of those involved in the
incident, to Public Safety and Police Services and to Mr.
Jesus Peña, Title IX Coordinator.
Jesus A. Peña, Esq.
Deputy to the President for Compliance, Equity & Legal
Affairs
(610) 683-4700
pena@kutztown.edu
There is a 10% per late late penalty for projects that come in
after the due date. During a working session you may leave
after completing and turning in all due work; you are
encouraged to stay to get additional practice and ask
questions. Thank you.
RESOURCES & HANDOUTS. We will use research results
published by students & me to discuss various topics.
Link to the Fall
2019 CSC458 prerequisite course.
Link to the Spring
2018 CSC558 offering.
Slides leading to Assignment 1 on Ensemble Learning.
Fall 2017 CSC458 slides
on evaluating numeric prediction.
A summary of the Kappa
Statistic.
A subset of Chapter
5 on Evaluation and 7 on Data Transformations.
Chapter 12 on Ensemble
Learning.
Here are textbook
slides from Kotu’s & Deshpande’s Predictive
Analytics and Data Mining: Concepts and Practice using
RapidMiner.
There is an excellent on-line video course Predictive
Analytics Training with Weka (Introduction) by one of
our textbook authors & Weka creators.
Here is our
textbook's website. We will be using version 3.8.4 of the
Weka tool set, which you can download
to your machine from here.
The PDF
Appendix to our textbook is here. It is a 128-page
tutorial on using Weka. Here is the Weka Wiki.
Additional
Weka documentation is here.
I will draw some material from
this textbook as well.
We may use the
Kaggle site for a project some time this semester.
D. Parson and A. Seidel, "Mining
Student Time Management Patterns in Programming Projects,"
Proceedings of FECS'14:
2014 Intl. Conf. on Frontiers in CS & CE Education,
Las Vegas, NV, July 21 - 24, 2014. Here are the slides
for the talk and the outline for the follow-up tutorial
"Using
Weka to Mine Temporal Work Patterns of Programming Students."
D. Parson, L. Bogumil & A. Seidel, "Data
Mining Temporal Work Patterns of Programming Student
Populations," Proceedings of the 30th Annual Spring
Conference of the Pennsylvania Computer and Information
Science Educators (PACISE)
Edinboro University of PA, Edinboro, PA, April 10-11, 2015.
Here are the slides
from the talk.
D. Parson, D. E. Hoch & H. Langley, "Timbral
Data Sonification from Parallel Attribute Graphs,"
Proceedings of the 31st Annual Spring Conference of the
Pennsylvania Computer and Information Science Educators
(PACISE) Kutztown University of PA, Kutztown, PA, April 1-2,
2016. Here are the slides
from the talk.
ASSIGNMENTS
Readings related to several assignments:
Machine
Listening with Very Small Training Datasets, Wissam
Malke’s January 2017 master’s thesis.
Mapping
Data Visualization to Timbral Sonification and Machine
Listening (a spring 2017 unpublished paper by Parson,
et. al.)
Instance-Based
Learning Algorithms, a paper from 1991.
K*:
An Instance-based Learner Using an Entropic Distance Measure,
a paper from 1995.
Locally
Weighted Naive Bayes, a paper from 2012.
A
graph on informational entropy, relates to building
rules, decision trees, and K*.
Audio
signal overview from spring 2020 for Assignment 1.
Assignment 1 is due
via make turnitin by 11:59 PM Wednesday February 19.
10% per day late penalty applies.
Here is the slide
on information entropy, which relates to how KStar and
many decision trees make decisions.
The link to this slide
was broken on 2/4. We will go over it on 2/11.
My
Assignment 1 answers are posted here.
Assignment 2 is due via make
turnitin by 11:59 PM Wednesday March 4. 10% per day late
penalty applies.
cp
~parson/DataMine/whitenoise558sp2020/checkfiles.sh
checkfiles.sh
from within your project directory before
you make turnitin. It won't affect anything other than
turning it in.
My
Assignment 2 answers are posted here.
Added 3/29: A bash script to solve
this assignment and its output
to the terminal.
~parson/DataMine/whitenoise558sp2020shell.zip on acad
Another bash script that
projects the class attribute from other attributes for
classification by scikit-learn.
Assignment 3 on analyzing
time-series data is due by 11:59 PM April 15 via make
turnitin.
Time
series PPT slides.
My
solution to Assignment 3 and a concept
related to bonus points awarded for Q12.
Chapter
4 Weka slides starting at slide 76 on logistic
regression, emphasis on hyperplanes and perceptrons (neural
nets).
"Algorithms: the basic methods" -- Instance
based learning, nearest neighbor KD and Ball trees are in
these slides.
My
slides abstracted from Weka Section 7 on "Extending
Linear Models". Support vector machines & neural nets.
Slides for
Chapter 6, Rules and Trees.
Slides for
Chapter 7, Extending instance-based and linear models.
Assignments 4 & 5
are individual student mini-research projects.
Assignment 4 is due April 5, and Assignment
5 is due April 26, with talks to follow, as described in
the linked handout.
Liquid Interactive will make an award to
the best overall project as described in the linked handout.
ZOOM RECORDING ARCHIVES
Jan
21, 2020 Last hour of class was start of an overview of
using Weka.
Jan
28, 2020 Went through fall's csc458's final comprehensive
assignment, demoing how to use Weka, and also slides on error
measures.
Feb
4, 2020 Went over Assignment 1 and the audio signal data
domain on which it is based.
Feb
11, 2020 Went over slides
for Ensemble Learning & also a case study of recruitment &
retention of scholarship students.
Feb
18, 2020 Went over Assignment 2 handout, Assignment 4 &
5 topics & Liquid award, answered some Assignment 1 questions.
Feb
25, 2020 Went over history of student programmer study, data
sonification, & machine listener, in context of instance-based
models.
Mar
3, 2020 Went over time series slides & examples, then
work session Q&A.
Mar
24, 2020 Went over my solution to assn2, handout for assn3,
and some Q&A about assn4/5.
Mar
31, 2020 Went over bash shell scripting for batch,
command-line invocation of Weka, some assignment Q&A.
April
1 office hours did some trial & error work on removing
non-ASCII chars from .csv files.
Apr
8, 2020 Office hour Q&A on ifelse(...) in MathExpression
for Assignment 3. Did not record work session of 4/7.
Apr
14, 2020 Assorted Q&A about assignments 3 & 5 during
a work session.
Apr
21, 2020 Went over my solution to assn3 time-series and
answered numerous assn5 questions.
Apr
28, 2020 The first half of final project presentations.
I forgot to start recording until 5 minutes
into Faith's presentation. Apologies to Lori and Faith.
Apr
30, 2020 11 AM Faculty presentation on "Assessing a
Scholarship Program for Underrepresented Students in CS&IT"
May
5, 2020 The second half of the final project presentations.