CSC 558 - Data
Mining and Predictive Analytics II, Spring 2023, Wed 6-8:50 PM
in Old Main 158.
IT
outages planned for spring break
First day handout (syllabus that
is specific to this semester).
Textbook: Data Mining: Practical Machine Learning Tools
and Techniques, Fourth Edition, Witten, et. al., ISBN
978-0128042915. You can buy a discounted copy of the 3rd
Edition at the KU Book Store -- either edition is fine.
There are on-line copies of the Third Edition available in
Rohrbach Library.
There are 4 courses or 3 courses + (research or internship) in
our Graduate
Data Analytics certificate program.
You need to register free for the program,
and you can use courses from a CSIT master's program.
Talk to me if you are interested.
I commit to using each student's preferred name and preferred
gender pronoun. Feel free to contact me in private if I make
mistakes in pronunciation, name, gender, or anything else.
Thanks!
Gender-Based Crimes
Educators must report incidents of gender-based crimes,
including sexual assault, sexual harassment, stalking, dating
violence, and domestic violence. If a student discloses
such incidents to me during class or in a course assignment, I
am not required to report the disclosure, unless the student
was a minor at the time the incident occurred. to me
outside the classroom setting or a course assignment, I am
required by law to report the disclosure, including relevant
details, such as the names of those involved in the incident,
to Public Safety and Police Services and to Mr. Deputy to the
President for Compliance, Equity & Legal Affairs
(610) 683-4700
pena@kutztown.edu
There is a 10% per late late penalty for projects that come in
after the due date. During a working session you may leave
after completing and turning in all due work; you are
encouraged to stay to get additional practice and ask
questions. Thank you.
RESOURCES & HANDOUTS. We will use research results
published by students & me to discuss various topics.
Link to the Fall 2022 CSC458 course.
Jan. 25 Weka overview using data from acad's
~parson/DataMine/csc570spring2023/month_aggregate_HMS_goodyears.arff.gz
Data is a superset of that used in Fall
2022 CSC458 Assignment 3.
See Section 3 of ongoing Analysis
of Hawk Mountain Sanctuary Observation Data from 1976
through 2021.
Slides leading to Assignment on Ensemble Learning.
Weka slides on evaluating numeric
prediction.
A summary of the Kappa
Statistic.
A subset of Chapter
5 on Evaluation and 7 on Data Transformations.
Chapter 12 on Ensemble Learning.
Here is our
textbook's website. We will be using version 3.8.5 of the Weka tool set, which you can
download to your machine from here.
Here is where you can run it on campus PCs
(S:\ComputerScience\WEKA):
The PDF
Appendix to our textbook is here. It is a 128-page
tutorial on using Weka. Here is the Weka Wiki.
Additional
Weka documentation is here.
I will draw some material from
this textbook as well.
We may use the
Kaggle site for a project some time this semester.
D. Parson and A. Seidel, "Mining
Student Time Management Patterns in Programming Projects,"
Proceedings of FECS'14:
2014 Intl. Conf. on Frontiers in CS & CE Education,
Las Vegas, NV, July 21 - 24, 2014. Here are the slides
for the talk and the outline for the follow-up tutorial
"Using
Weka to Mine Temporal Work Patterns of Programming Students."
D. Parson, L. Bogumil & A. Seidel, "Data
Mining Temporal Work Patterns of Programming Student
Populations," Proceedings of the 30th Annual Spring
Conference of the Pennsylvania Computer and Information
Science Educators (PACISE)
Edinboro University of PA, Edinboro, PA, April 10-11, 2015.
Here are the slides
from the talk.
D. Parson, D. E. Hoch & H. Langley, "Timbral
Data Sonification from Parallel Attribute Graphs,"
Proceedings of the 31st Annual Spring Conference of the
Pennsylvania Computer and Information Science Educators
(PACISE) Kutztown University of PA, Kutztown, PA, April 1-2,
2016. Here are the slides
from the talk.
Background
for 11/15 presentation on prepping a framework for
CSC523 projects using Weka-like workflow.
Code on acad / mcgonagall under
~parson/DataMine/TimeRaptors4CSC523
Bash
Shell Scripting for Data Science lecture slides for
first hour 11/29/2021
ASSIGNMENTS
To start Weka on campus PCs with S: mounted go to
S:\ComputerScience\WEKA and click one of the batch files.
ASSIGNMENT 1: Review of Classification,
Introduction to Instance Based (Lazy) Learning
PREP:
Compilation of Weka slides
on Instance Based Learning and Clustering
Wissam Malke's thesis "Machine
Listening with Very Small Training Datasets"
Slides for his
thesis
Follow-up white paper "Mapping
Data Visualization to Timbral Sonification and Machine
Listening"
Instance-Based Learning
Algorithms, a paper from 1991.
K*: An Instance-based Learner Using
an Entropic Distance Measure, a paper from 1995.
Locally Weighted Naive
Bayes, a paper from 2012.
A
graph on informational entropy, relates to building rules,
decision trees, and K*.
Kappa
statistic summary
Audio
signal overview for Assignment 1 (reused with data
modifications for fall 2021).
ASSIGNMENT 1 on
instance-based learning is due via D2L Assessments ->
Assignments by end of Feb. 25.
Late penalty is 10% per day starting 9 AM
after the deadline. This deadline bumped from Feb 18 on
2/1/2023.
If your training and testing files have
attributes ordered differently or tid is missing from training
data you may see this pop-up:
Just click Yes and you'll
see the attribute mapping like this when you run the
classifier.
Here are the AddExpression
derived attribute distributions for checking your
AddExpression results.
Here is my Python
script for deriving an alternate representation of these data.
Post-mortem on Assignment 1 out-of-range funfreq for some
PulseOsc:
Comparison of lowest and highest tfreq tags
by tosc, max funfreq for PulseOsc.
Max funfreq time
domain and frequency
domain graphs for tid 234361.
Frequency
domain bins with my debugging hack for 234361 extracted
by my ChucK
program (Chuck Reference).
ChucK
program that generated 234361 WAV file.
Python program for converting above
ChucK output to CSV.
Non-Chuck Python program
uses numpy and scipy to extract frequency bins from WAV
files.
Bin analysis debug printing for tid
234361.
ARFF
file that does not chop off frequencies below 100 Hz (no
high-pass filter).
ARFF
file that does chop off frequencies below 100 Hz (a
high-pass filter)
Extending above ChucK program from 512 to
32,768 FFT bins fixes the aliasing problem > 2000 Hz.
See on acad
~parson/DataMine/ We will go over this fix March 8.
PulseUna_234361_512.txt
PulseUna_234361_22050.txt
PulseUna_234361_22050_examine.txt
See new application of FFTs to deep learning
here
and here.
Pivoted
ARFF file, March 8, show step, frequency, and amplitude
steps for 32,768-bin Chuck FFT extraction.
ASSIGNMENT
2 on regression & ensemble models is due by
11:59 PM on Monday March 13, 2023 via
D2L Assignment 2.
Here is the README.txt that includes the
questions.
Potential pitfall: Q7: Re-Normalize if
necessary to get Normalized non-target attributes.
Continue using
Normalized nontarget attributes unless otherwise instructed
(at Q17).
The standard 10% per day deduction for late
assignments applies.
The 13th is the start of spring break, so no
office hours that day.
The preceding weeks’ office hours are at
usual times & modalities.
ASSIGNMENT
3 on time series regression is due by the end of
Sunday April 9.
ASSIGNMENTS 4 & 5 will be student mini-research
projects.
Assignment 5 grading criteria are here:
https://faculty.kutztown.edu/parson/fall2021/CSC558FallFinal2021.html#Assignment_5
Please make sure to turn in slides, data or data dictionary /
schema,
and Sections 4 & 5 of your PDF via D2L by
11:59 Wednesday, the night of the talks.
Project 5 grading rubrics:
The 5.3 presentation counts for 20% of this
project. Make it clear. Be ready to answer questions.
If in the classroom,
repeat the questions for Zoom students before answering them.
The remaining 80% distribute as 10 points
each for 5.1 and 5.2.a through 5.2.g.
1 18:00 Reagan
Newswanger Superbowl
adverisement data
2 18:15 Olivia
Weber
Proprietary dataset Direct Operating Profit correlations.
3 18:30 Pei Hua
Lin
Mobile strategy games from Kaggle
4 18:45 Yelitza
Pagan
PASSHE employee pay correlations
5 19:00 Jack
McNally
Student survey academic performance analysis (Kaggle)
6 19:15 Ryan
Quinn
Probability of voting per other attributes
19:30
BREAK
7 19:45 Nicholas
Morello Song attribute
analysis from Kaggle
8 20:00 Donna DeMarco
NSF-funded scholarship analysis
9 20:15 Abby
Komlenic
Academic fund raising amounts corerlated with other donor
properties
10 20:30 Edwin
Cadiz
Patient experience/reviews using Ablify to treat mental health
conditions
11 20:45 Nik
Golombek
Fuel Efficiency
12 21:00 Nathan
Rew
USGS stream data differentials in Limestone hi vs. lo
Here is the link
to Fall 2021 Assignments 4 & 5. We are using these specs.
Here is the topic sheet for spring
2023.
You can turn a data dictionary like
this if working with proprietary data. It needn't be
that long. Obfuscating such data is OK.
We will use the same requirements as fall 2021.
Assignment 4 is due Wednesday April 26 and Assignment
5 is due Wednesday May 10 via D2L.
Student presentations
will be during the Final Exam period at class time on Wednesday
May 10.
Additional sites beyond
what is available on the fall 2021 page:
https://waterdata.usgs.gov/nwis/rt
Use across-time for a site or across-site for a time period.
We have
analyzed dissolved oxygen levels as a function of temperature,
day of year, time of day.
You
could analyze changes in properties as a function of year or
year,month for a single site.
You
could analyze differences in attributes as a function of
different sites in the same time period.
https://data.noaa.gov/datasetsearch/
and https://www.ncei.noaa.gov/cdo-web/
fopr climate data.
You
could look for global stilling (reduction in wind speed
differences) for two sites N-to-S.
Kaggle is a popular data
source. Make your own analysis or variation.
John
Hopkins Coronavirus data & maybe others until March
2023.
Liquid Interactive is
sponsoring 3 prizes ($1000, $500, and $250) for the top-three
projects in 2023.
Connor Ellis' presentation
and short
paper from fall 2021.
Kelly Fox's presentation and
short paper
from spring 2020.
Tyler Stoney's presentation
and short paper
from spring 2018.
ZOOM VIDEO ARCHIVE
Jan
25 class: First day handout, overview / review of some
CSC458 projects & related, introduction / review of Weka.
Feb
1 class: Went over instance-based learning slides
& a Weka lazy/instance based demo using PA water data.
Assignment 1 due date slipped to 2/25. We
will go over it & have a work session on 2/8.
Feb
8 class: Went over Assignment 1 handout + short
discussion of Bayes and conditional probability.
Feb
15 class Went over slides on clustering & demoed Weka
K-means clusters using Assignment 1 data.
Assignment 1 demo of AddExpression + Q&A
in last part of class.
Feb
22 class went over slides on Evaluating
Regression and Ensemble Models, then Assignment
2 up to Discretize
Mar 1 class Post-mortem on Assignment 1 above, including
alternate data preprocessing, finished Assignment 2 from
Discretize until Q13.
Mar
8 class Association Rules &
Clustering. Preceded by some Assignment 2 Q&A for
first 20ish minutes.
I missed hitting Zoom RECORD after 8 PM so I
will make those topics clear in Assignment 3.
Mar 22 class End of Assignment 2 and went over Assignment 3
handout & related Hawk Mountain background.
Mar
29 class Graduate Certificate discussion, table-driven
Python scripting for CSC523, work session.
Assignment 3 handout updated with Q&A
from last night's work session.
Apr
5 class Went over Python code to run Weka command line for
all Assignment 3 variants.
Apr
12 class Went over solutions to Assignment 3 &
requirements for Assignments 4 & 5.
April
19 class Script for fixing CSV files
with some rows of data that are too big or small, bash
shell commands for manipulating text data.
April
26 class some Q&A on Assignments 4&5 and my
extension Python script for Assignment 3 RAPTOR_td,
RAPTOR_td_log10 target attributes.
Assignment 5 due date is
bumped to May 10 11:59 PM, the night of the
15-minute presentations.
May
3 class Some reminders about Assignment 5, then
short discussion of my Assignment 3 work in progress, then
work time Q&A.
May
10 class student presentations of assignment 4-5.