CSC 558 - Data
Mining and Predictive Analytics II, Fall 2021, Monday 6-8:50
PM in Old Main 158.
First day handout (syllabus that
is specific to this semester).
Textbook: Data Mining: Practical Machine Learning Tools
and Techniques, Fourth Edition, Witten, et. al., ISBN
978-0128042915. You can buy a discounted copy of the 3rd
Edition at the KU Book Store -- either edition is fine.
There are on-line copies of the Third Edition available in
Rohrbach Library.
There are 4 courses or 3 courses + (research or internship) in
our Graduate
Data Analytics certificate program.
You need to register free for the program,
and you can use courses from a CSIT master's program.
Talk to me if you are interested.
I commit to using each student's preferred name and preferred
gender pronoun. Feel free to contact me in private if I make
mistakes in pronunciation, name, gender, or anything else.
Thanks!
Gender-Based Crimes
Educators must report incidents of gender-based crimes,
including sexual assault, sexual harassment, stalking, dating
violence, and domestic violence. If a student discloses
such incidents to me during class or in a course assignment, I
am not required to report the disclosure, unless the student
was a minor at the time the incident occurred.
Regardless of the student’s age, if the incident is disclosed
to me outside the classroom setting or a course assignment, I
am required by law to report the disclosure, including
relevant details, such as the names of those involved in the
incident, to Public Safety and Police Services and to Mr.
Jesus Peña, Title IX Coordinator.
Jesus A. Peña, Esq.
Deputy to the President for Compliance, Equity & Legal
Affairs
(610) 683-4700
pena@kutztown.edu
There is a 10% per late late penalty for projects that come in
after the due date. During a working session you may leave
after completing and turning in all due work; you are
encouraged to stay to get additional practice and ask
questions. Thank you.
RESOURCES & HANDOUTS. We will use research results
published by students & me to discuss various topics.
Link to the Spring 2021 CSC458 prerequisite
course.
Slides leading to Assignment on Ensemble Learning.
Fall 2017 CSC458 slides on evaluating numeric
prediction.
A summary of the Kappa
Statistic.
A subset of Chapter
5 on Evaluation and 7 on Data Transformations.
Chapter 12 on Ensemble Learning.
Here are textbook
slides from Kotu’s & Deshpande’s Predictive
Analytics and Data Mining: Concepts and Practice using
RapidMiner.
There is an excellent on-line video course Predictive
Analytics Training with Weka (Introduction) by one of
our textbook authors & Weka creators.
Here is our
textbook's website. We will be using version 3.8.5 of the
Weka tool set, which you can download
to your machine from here.
Here is where you can run it on campus PCs
(S:\ComputerScience\WEKA):
The PDF
Appendix to our textbook is here. It is a 128-page
tutorial on using Weka. Here is the Weka Wiki.
Additional
Weka documentation is here.
I will draw some material from
this textbook as well.
We may use the
Kaggle site for a project some time this semester.
D. Parson and A. Seidel, "Mining
Student Time Management Patterns in Programming Projects,"
Proceedings of FECS'14:
2014 Intl. Conf. on Frontiers in CS & CE Education,
Las Vegas, NV, July 21 - 24, 2014. Here are the slides
for the talk and the outline for the follow-up tutorial
"Using
Weka to Mine Temporal Work Patterns of Programming Students."
D. Parson, L. Bogumil & A. Seidel, "Data
Mining Temporal Work Patterns of Programming Student
Populations," Proceedings of the 30th Annual Spring
Conference of the Pennsylvania Computer and Information
Science Educators (PACISE)
Edinboro University of PA, Edinboro, PA, April 10-11, 2015.
Here are the slides
from the talk.
D. Parson, D. E. Hoch & H. Langley, "Timbral
Data Sonification from Parallel Attribute Graphs,"
Proceedings of the 31st Annual Spring Conference of the
Pennsylvania Computer and Information Science Educators
(PACISE) Kutztown University of PA, Kutztown, PA, April 1-2,
2016. Here are the slides
from the talk.
Background
for 11/15 presentation on prepping a framework for
CSC523 projects using Weka-like workflow.
Code on acad / mcgonagall under
~parson/DataMine/TimeRaptors4CSC523
Bash Shell
Scripting for Data Science lecture slides for first hour
11/29/2021
ASSIGNMENTS
To start Weka on campus PCs with S: mounted go to
S:\ComputerScience\WEKA and click one of the batch files.
ASSIGNMENT 0: Stay Safe & Protect Your
Neighbors in Fall 2021
"Simulated
Contact Tracing of COVID-19 Propagation at Kutztown University
for Fall 2020" best PACISE 2021 faculty paper
Slides for the
updated talk. Fall 2021 update. Python R0 calculator.
Simulation code at
S:\ComputerScience\Parson\Processing3\CovidUFall2021V1MIDI
KU Mask
Adherence Process
ASSIGNMENT 1: Review of Classification,
Introduction to Instance Based (Lazy) Learning
PREP:
Compilation of Weka slides
on Instance Based Learning and Clustering
Wissam Malke's thesis "Machine
Listening with Very Small Training Datasets"
Slides for his
thesis
Follow-up white paper "Mapping
Data Visualization to Timbral Sonification and Machine
Listening"
Instance-Based Learning
Algorithms, a paper from 1991.
K*: An Instance-based Learner Using
an Entropic Distance Measure, a paper from 1995.
Locally Weighted Naive
Bayes, a paper from 2012.
A
graph on informational entropy, relates to building rules,
decision trees, and K*.
Kappa
statistic summary
Audio
signal overview for Assignment 1 (reused with data
modifications for fall 2021).
ASSIGNMENT
1 on instance-based learning is due via D2L
Assessments -> Assignments by end of 9/30.
Late penalty is 10% per day starting 9 AM
after the deadline.
If your training and testing files have
attributes ordered differently or tid is missing from training
data you may see this pop-up:
Just click Yes and you'll see the attribute mapping like
this when you run the classifier.
Here is some related
Python code in a zip file for discussion in the September
20 class.
Here are the
AddExpression derived attribute distributions from
email to class on 9/26.
Here is my solution posted from class on
October 4 (temporarily removed).
Here is my Python script for deriving
an alternate representation of these data.
ARFF files are under
S:\ComputerScience\parson/Weka\audioExample on the KU PC
network.
ASSIGNMENT
2 will be on applying ensemble learning &
previous techniques.
It us due by end of
Thursday October 21 via D2L Assignment 2.
10% penalty per each day
late without a medical excuse.
Slides leading to Assignment on Ensemble
Learning.
CSC458 slides
on evaluating numeric prediction.
A summary of the Kappa
Statistic.
A subset of Chapter
5 on Evaluation and 7 on Data Transformations.
Chapter 12 on Ensemble Learning.
Here is my solution.
ASSIGNMENT
3 on analyzing time-series data is due by 11:59
PM November 11 via D2L Assignment 3.
Time series PPT slides
for October 18.
Kobayashi
Maru relates to Q12's bonus points and winni9ng a no-win
situation by redefining the rules of engagement.
After two students
stumbled onto this approach, I incorporated it into Q14 after
exploring clustering of Q15.
ASSIGNMENTS
4 & 5 are individual student mini-research
projects.
ASSIGNMENT 4 is due via D2L on Friday
November 26.
ASSIGNMENT 5 materials are due via D2L on
Sunday December 5, with presentations on the 6th &
13th.
Liquid Interactive will make an award to the
best overall project as described in the linked handout.
ZOOM VIDEO ARCHIVE
August
30 First day handout, semester overview, and walk-through
of ASSIGNMENT 0 data-driven COVID@KU simulations.
September
13 Went over instance-based (lazy) machine learning &
start Assignment 1.
Zoom glitches due to Live Transcript. I will
disable that in the future.
September
20 Went over various perspectives on time-periodic
datasets and their frequency domain normalization &
analysis.
Here is some related
Python code in a zip file for discussion in the
September 20 class.
September
27 Assignment 1 pointers, slides on MDL and
evaluating numeric prediction, Assignment 1 Q&A.
October 4 (temporarily removed)
went over solution to assn1 including alternative FFT encoding,
then ensemble learning and assn2.
October
18 went over slides on Time Series and then
work / Q&A for Assignment 2.
October
25 went over my solutionS with added data encodings for
Assignment 2 and then got up to Q13 on new Assignment 3.
November
1 some Assignment 3 Q&A, clustering on Q13-Q15, overview
of individual Assignments 4 & 5.
November
8 went over Python scripts for batch automating Weka runs
then Q&A on Assignment 3.
Scripts on acad
/home/kutztown.edu/parson/DataMine/csc458ensemble5sp2021 in files
csc458ensemble5sp2021.py
parallel/csc458ParallelEnsemble5sp2021.py
November
23 (no video for November 15 Assignment 3 solution because
one student's still out) ...
Using
OneR iteratively to find most important attributes;
LinearRegression / M5P / Normalize guidelines with reference to
Assignment 2.
November
29 Went over Bash Shell
Scripting for Data Science slides then Q&A on assn5.
December
6 First group of final
student presentations.
December
13 Second group of final
student presentations.