CSC 458 - Data
Mining & Predictive Analytics I, Spring 2024, Wed
6:00-8:50 PM. Old Main 158.
Dr. Dale E. Parson, https://faculty.kutztown.edu/parson
Class-time Zoom link for CSC458
OR See D2L Course CSC458 -> Content -> Overview for
the link.
Student
instructions for using Zoom.
IF
you don’t want to be recorded or are a minor, use
PRIVATE ZOOM CHAT to me for questions.
Please fill out & email Dr. Parson this
permission to record slip. I will use it to
take attendance in week 1.
Office Hours Monday 2-4, Wednesday 4-6 (Zoom only),
Thursday 10-11 or by appt. All available via Zoom.
parson@kutztown.edu, Office hours: https://kutztown.zoom.us/j/94322223872
Thursday
May 9 office hour switched to 1-2 PM, others
as above.
First day handout (syllabus that is
specific to this semester).
I commit to using each student's preferred name and preferred
gender pronoun. Feel free to contact me in private if I make
mistakes in pronunciation, name, gender, or anything else.
Thanks!
Gender-Based Crimes
Educators must report incidents of gender-based crimes,
including sexual assault, sexual harassment, stalking, dating
violence, and domestic violence. If a student discloses
such incidents to me during class or in a course assignment, I
am not required to report the disclosure, unless the student was
a minor at the time the incident occurred. Regardless of
the student’s age, if the incident is disclosed to me outside
the classroom setting or a course assignment, I am required by
law to report the disclosure, including relevant details, such
as the names of those involved in the incident, to Public Safety
and Police Services and to Mr. Jesus Peña, Title IX Coordinator.
Jesus A. Peña, Esq.
Deputy to the President for Compliance, Equity & Legal
Affairs
(610) 683-4700
pena@kutztown.edu
There is a 10% per late late penalty for projects that come in
after the due date. There will be a 10% deduction from a
homework assignment for repeated web surfing, web-based chatting
or other use of the Internet for activities unrelated to class
activities during both lectures and working sessions. During a
working session you may leave after completing and turning in
all due work; you are encouraged to stay to get additional
practice and ask questions. Thank you.
RESOURCES & HANDOUTS.
For students new to using our department's
Linux servers:
Please
log into acad or mcgonagall and run the following commands:
$ python
-V
Python 3.7.7
$ ipython -V
7.14.0
If you
see earlier version numbers, edit a file called .bash_profile in
your login directory and add the following 2 lines at the top:
alias
python="/usr/local/bin/python3.7"
alias ipython="/usr/local/bin/ipython3"
Log out,
log back in, and check the version numbers again. Let me know if
you run into problems.
Windows users can download the WinSCP file transfer
client in the Computer Science sub-menu below here.
It is also possible to use the scp,
ssh-based file copy command in Mac or Windows command line
utilities.
Textbook: Data Mining: Practical Machine Learning
Tools and Techniques, Fourth Edition, Witten, et.
al., ISBN 978-0128042915. You can buy a discounted
copy of the 3rd Edition at the KU Book Store -- either
edition is fine. I have put a copy of the 3rd
edition of the textbook on reserve in Rohrbach Library.
You can go to the front desk & borrow it overnight.
If you are new to Python you are encouraged to come to my
office hours in person or via Zoom.
I adapted from Kotu’s & Deshpande’s Predictive
Analytics and Data Mining: slides
for Linear Regression and M5P Trees, 10/30/2017.
Here is our textbook's web page. We will be
using the Weka tool set, which you can download to your machine from here.
(Download & install Weka 3.8.6)
The PDF Appendix to our textbook is here.
It is a 128-page tutorial on using Weka. Here is the Weka Wiki.
I will draw some material from this
textbook as well.
The Weka download page has (had?) this note:
If your computer has a display that has a high pixel density,
and you are using Windows, Weka's user interfaces may not be
scaled appropriately and appear tiny. Installing Java 9 or later
solves this problem. Alternatively, in the Program menu of
Weka's GUIChooser, go into Settings, and select
WindowsLookAndFeel from the "Look and feel for UI" dropdown
menu. Some Weka packages currently do not work (properly) with
Java 9 or later (tigerJython and scatterPlot3D).
PYTHON.
How
to Think Like a Computer Scientist looks like a good
tutorial for Python 3.x newbies.
We will be using the 3.x version of Python.
Try running python -V to see that you are
getting Python 3.x.x as your default. We may be updating
the version early in the semester.
From the mcgonagall
machine (ssh mcgonagall from acad) do the following
actions in bold:
Edit a file called .bash_profile
in your login directory (create it if needed) and add these
2 lines near the top.
export PATH="/usr/local/bin:${PATH}"
alias python="/usr/local/bin/python3.7"
alias ipython="/usr/local/bin/ipython3"
Save the file
and exit the editor, log out and log back into mcgonagall.
Now type this:
python -V # You
should see this:
Python 3.7.7
If you
install python on your own machine, just running python will get
you the simpler-to-use interpreter.
I will
use ipython in lecture.
The Python website is at http://www.python.org/.
There is a good on-line tutorial and
reference by Steven F. Lott called Building Skills in Python.
There is a PDF copy here.
I taught CSC223
using Python in fall 2023. There are many tutorial
resources.
The IPython
site is here.
We will be using Python for data preparation in assignment 1.
We have Python installed on acad, but if you
want your own copy:
You can download Python 3.x
from here. Use the most recent stable 3.x for this course.
Documentation including tutorials for
the 3.x
library is here.
Here are my introductory slides on Python. We
will explore Python in class.
RESOURCES
The pythex utility for
testing Python regular expressions
D. Parson, 2022, Analysis
of Hawk Mountain Sanctuary Observation Data from 1976 through
2021
D. Parson and A. Seidel, "Mining
Student Time Management Patterns in Programming Projects,"
Proceedings of FECS'14:
2014 Intl. Conf. on Frontiers in CS & CE Education,
Las Vegas, NV, July 21 - 24, 2014. Here are the slides
for the talk and the outline for the follow-up tutorial "Using
Weka to Mine Temporal Work Patterns of Programming Students."
D. Parson, L. Bogumil & A. Seidel, "Data
Mining Temporal Work Patterns of Programming Student
Populations," Proceedings of the 30th Annual Spring
Conference of the Pennsylvania Computer and Information Science
Educators (PACISE) Edinboro
University of PA, Edinboro, PA, April 10-11, 2015. Here are the
slides
from the talk.
D. Parson, D. E. Hoch & H. Langley, "Timbral
Data Sonification from Parallel Attribute Graphs,"
Proceedings of the 31st Annual Spring Conference of the
Pennsylvania Computer and Information Science Educators (PACISE)
Kutztown University of PA, Kutztown, PA, April 1-2, 2016. Here
are the slides
from the talk.
Wissam Malke's thesis "Machine
Listening with Very Small Training Datasets".
Mapping Data Visualization to Timbral
Sonification and Machine Listening, Dale E. Parson, Wissam
Malke, Halley Langley, and Danielle Emily Hoch, white paper,
2017.
D. Parson, "Simulated
Contact Tracing of COVID-19 Propagation at Kutztown
University for Fall 2020"
and the slides. Here is one
video and here
is another.
Textbook slides
Chapter
1 (week 1 - overview)
Chapter
2 (week 1 - input)
Chapter
3 (week 3 - output)
Chapter 4 (rules
& trees week 5, linear models & model trees week 9,
Bayesian inference week 11, clustering week 12)
Compilation
of Weka slides on Instance Based Learning and Clustering
A graph on informational entropy,
relates to building rules & decision trees.
A page describing Bayes theorem and
related matters.
A Bayes computer
for a 52-card deck is on acad at
~parson/DataMine/BayesCards.py
BayesNet
examples from the textbook.
Chapter 5 (5.1 - 5.5
week 8 - evaluation)
Chapter
8 (week 6 - data transformations)
Chapter 12 on
Ensemble Learning
Time-series
Data Analysis
I will draw some material from this
textbook as well.
Chapter 1 (overview)
Chapter
2 (overview)
Chapter
3A (data exploration)
Chapter
3B (data exploration)
Chapter
4A (information-based learning)
Chapter
6A (probability-based learning)
Chapter
6B (probability-based learning)
Chapter 8A
(evaluation)
Chapter 8B (evaluation)
Appendix A (descriptive statistics
& data visualization)
Appendix B (introduction to
probability)
ASSIGNMENTS
ASSIGNMENT
1 on Classification due by 11:59 PM on Thursday February
15 via D2L.
Turn in files CSC458S24ClassifyAssn1Turnin.arff
and README.txt with your answers.
Here is documentation on the Kappa
metric.
Here is a slide on information
entropy used in decision tree building.
ASSIGNMENT
2 on Regression due by 11:59 PM on Thursday February 29
via D2L.
Start
at slide 60 Evaluating Numeric Prediction for
correlation coefficient and error measures MAE and RMSE.
See ~parson/DataMine/pearson.py on
acad.
ASSIGNMENT 3
on data compression & discrete classification is due
11:59 PM March 21 via D2L.
ASSIGNMENT 4 Data
Cleaning Project in Python due date via D2L
Assignment 4 is Friday April 12 by 11:59 PM.
ASSIGNMENT 5
due 11:59 PM Thursday May 2 via D2L Assignment
5.
ADDED a Python
script to analyze monotonic cluster sequences for Q11
in
this zip file.
We will go over
it on May 8 in class.
If you are
new to working on our Linux systems, you can bring up a
Windows CMD prompt and from there:
ssh
YOURLOGINID@acad.kutztown.edu
It is possible to use the scp,
ssh-based file copy command in Mac or Windows
command line utilities.
If you do not have a favorite
Linux text editor, try running nano on acad. Python
assignments must run on mcgonagall.
PYTHON SELF STUDY FROM LAST
TERM:
For anyone not experienced programming
Python, here are some tutorials for over spring
break
from my CSC223 course last semester:
Weekly class time materials
Week
1: Python Resources and Python
Basics
Read and work along with Sections
1 through 5 of the Python Tutorial
in parallel to our class time examination of
Python basics.
Week
2 is about functions and function-like
constructs in Python.
Week
3 is the sorting example ...
August
29 Class First-day handout,
overview of the course, logging into
acad & mcgonagall Linux servers for
your projects.
August
31 Class Setting up python and
ipython aliases on acad. Interactive
walk through of primitive data types
& some aggregate types
(int, float, str,
None, list, tuple, set,
& frozenset types. The mutable
object types are in bold.)
Next time dicts (a.k.a. maps).
September
5 Class Completed data types with
dictionaries, went over if-for-while
control constructs, functions, and
Python's use of indentation.
ZOOM
ARCHIVES
Jan
24 Class introduced the course, certificates &
minor in data science @ KU, Weka & concepts.
Jan
31 Class went over mechanics and concepts for
Assignment 1.
Feb
7 Class on Assignment 1's
extractAudioFreqARFF17Oct2023.py for extracting assn1's
input
dataset from 10,0005 .wav audio files.
Here is a Feb 8 page on the
ZeroR confusion matrices
that confused us at the end of class. The
last 55 minutes were work time with some recorded Q&A.
Feb
14 Class evaluating numeric prediction (slide
60-69), correlation
coefficient, Assignment 2 handout.
Feb
21 Class solution to Assignment 1,
compressing polynomial & exponential data relationships
into linear relationships for use with
linear machine learning algorithms, Assignment 2 Q&A.
Solution to Assignment 1 removed for
summer 2024.
Feb
28 Class some Q&A from office hours on Assignment
2, finish slides on prepping & teaching
data science courses, hand out Assignment
3. There will be some work time next week.
Mar
6 Class Naive Bayes including
playing card
example, work time on Assignment 3.
March
20 Class went over using pythex utility to parse
data copied from new MyKU (Banner) course listings.
The Python
regular expression library will be used in either
assignment 4 or 5.
March
27 Class went over solution to Assignment 3
and then prep for Assignment 4 to be posted Friday 3/29.
April
3 Class half the slides
on Instance-Based ("Lazy") Learning, examples in Weka
& a visualization. Work time.
April
10 Class Finished slides
on IBk nearest-neighbor instance-based models, also
clustering, went over Assignment 5.
April
17 Class Went over slides, paper,
ran simulated contact tracing for COVID@KU2020.
50 minutes assn5 work
time.
April
24 Class Parallel Coordinates Visualization, Data
Sonification & Weka Listening Research from 2015-2017.
See Parson, Hoch, Langley, and Malke
references above.
May
1 Class went over summer
2022 and summer
2023 cleaning & analyses of Hawk Mountain data.