CSC 458 - Data
Mining & Predictive Analytics I, Fall 2022, TuTh 4:30-5:50
PM.
Data Mining Effects of 50 Years of Climate Change at Hawk
Mountain Sanctuary
Thursday November 17 11-11:45 AM in Old
Main 158.
PowerPoint
slides here. PDF
slides are here. Here is the presentation
Zoom video.
Summer's Analysis
of Hawk Mountain Sanctuary Observation Data from 1976
through 2021.
Dr. Dale E. Parson, http://faculty.kutztown.edu/parson
Class-time Zoom link for CSC458
OR See D2L Course CSC458 -> Content -> Overview for
the link.
Student
instructions for using Zoom.
IF
you don’t want to be recorded or are a minor, use
PRIVATE ZOOM CHAT to me for questions.
Please fill out & email Dr. Parson this
permission to record slip. I will use it to
take attendance in week 1.
Office Hours Monday 2-4, Wednesday 4-6 (Zoom only), Thursday
10-11 or by appt. All available via Zoom.
parson@kutztown.edu, Office hours: https://kutztown.zoom.us/j/94322223872
First day handout (syllabus that is
specific to this semester).
I commit to using each student's preferred name and preferred
gender pronoun. Feel free to contact me in private if I make
mistakes in pronunciation, name, gender, or anything else.
Thanks!
Gender-Based Crimes
Educators must report incidents of gender-based crimes,
including sexual assault, sexual harassment, stalking, dating
violence, and domestic violence. If a student discloses
such incidents to me during class or in a course assignment, I
am not required to report the disclosure, unless the student was
a minor at the time the incident occurred. Regardless of
the student’s age, if the incident is disclosed to me outside
the classroom setting or a course assignment, I am required by
law to report the disclosure, including relevant details, such
as the names of those involved in the incident, to Public Safety
and Police Services and to Mr. Jesus Peña, Title IX Coordinator.
Jesus A. Peña, Esq.
Deputy to the President for Compliance, Equity & Legal
Affairs
(610) 683-4700
pena@kutztown.edu
There is a 10% per late late penalty for projects that come in
after the due date. There will be a 10% deduction from a
homework assignment for repeated web surfing, web-based chatting
or other use of the Internet for activities unrelated to class
activities during both lectures and working sessions. During a
working session you may leave after completing and turning in
all due work; you are encouraged to stay to get additional
practice and ask questions. Thank you.
RESOURCES & HANDOUTS.
For students new to using our department's
Linux servers:
Please
log into acad or mcgonagall and run the following commands:
$ python
-V
Python 3.7.7
$ ipython -V
7.14.0
If you
see earlier version numbers, edit a file called .bash_profile in
your login directory and add the following 2 lines at the top:
alias
python="/usr/local/bin/python3.7"
alias ipython="/usr/local/bin/ipython3"
Log out,
log back in, and check the version numbers again. Let me know if
you run into problems.
Windows users can download the WinSCP file transfer
client in the Computer Science sub-menu below here. Textbook: Data Mining: Practical
Machine Learning Tools and Techniques, Fourth
Edition, Witten, et. al., ISBN 978-0128042915. You can
buy a discounted copy of the 3rd Edition at the KU Book
Store -- either edition is fine. I have put a copy
of the 3rd edition of the textbook on reserve in Rohrbach
Library. You can go to the front desk & borrow it
overnight.
The
Graduate Assistant Tutor schedule is here.
If you are new to Python and have basic
questions, I have told them it is OK for you to ask questions.
Of course you are encouraged to come to my
office hours in person or via Zoom.
I adapted from Kotu’s & Deshpande’s Predictive
Analytics and Data Mining: slides
for Linear Regression and M5P Trees, 10/30/2017.
There is an excellent on-line video course Predictive
Analytics Training with Weka (Introduction) by one of our
textbook authors & Weka creators.
Here is our
textbook's website. We will be using the Weka
tool set, which you can download
to your machine from here. (Download & install Weka
3.8.6)
The PDF
Appendix to our textbook is here. It is a 128-page
tutorial on using Weka. Here is the Weka Wiki.
I will draw some material from this
textbook as well.
The Weka download page has this note:
If your computer has a display that has a high pixel density,
and you are using Windows, Weka's user interfaces may not be
scaled appropriately and appear tiny. Installing Java 9 or later
solves this problem. Alternatively, in the Program menu of
Weka's GUIChooser, go into Settings, and select
WindowsLookAndFeel from the "Look and feel for UI" dropdown
menu. Some Weka packages currently do not work (properly) with
Java 9 or later (tigerJython and scatterPlot3D).
PYTHON.
How
to Think Like a Computer Scientist looks like a good
tutorial for Python 3.x newbies.
We will be using the 3.x version of Python.
Try running python -V to see that you are
getting Python 3.x.x as your default.
From the mcgonagall
machine (ssh mcgonagall from acad) do the following
actions in bold:
Edit a file called .bash_profile
in your login directory (create it if needed) and add these
2 lines near the top.
export PATH="/usr/local/bin:${PATH}"
alias python="/usr/local/bin/python3.7"
alias ipython="/usr/local/bin/ipython3"
Save the file
and exit the editor, log out and log back into mcgonagall.
Now type this:
python -V # You
should see this:
Python 3.7.7
If you
install python on your own machine, just running python will get
you the simpler-to-use interpreter.
I will
use ipython in lecture.
The Python website is at http://www.python.org/.
There is a good on-line tutorial and
reference by Steven F. Lott called Building Skills in Python.
There is a PDF copy here.
The IPython
site is here.
We will be using Python for data preparation in assignment 1.
We have Python installed on acad, but if you
want your own copy:
You can download Python 3.x
from here. Use the most recent stable 3.x for this course.
Documentation including tutorials for
the 3.x
library is here.
Here are my introductory slides on Python. We
will explore Python in class.
RESOURCES
The pythex utility for
testing Python regular expressions
D. Parson, 2022, Analysis
of Hawk Mountain Sanctuary Observation Data from 1976 through
2021
D. Parson and A. Seidel, "Mining
Student Time Management Patterns in Programming Projects,"
Proceedings of FECS'14:
2014 Intl. Conf. on Frontiers in CS & CE Education,
Las Vegas, NV, July 21 - 24, 2014. Here are the slides
for the talk and the outline for the follow-up tutorial "Using
Weka to Mine Temporal Work Patterns of Programming Students."
D. Parson, L. Bogumil & A. Seidel, "Data
Mining Temporal Work Patterns of Programming Student
Populations," Proceedings of the 30th Annual Spring
Conference of the Pennsylvania Computer and Information Science
Educators (PACISE) Edinboro
University of PA, Edinboro, PA, April 10-11, 2015. Here are the
slides
from the talk.
D. Parson, D. E. Hoch & H. Langley, "Timbral
Data Sonification from Parallel Attribute Graphs,"
Proceedings of the 31st Annual Spring Conference of the
Pennsylvania Computer and Information Science Educators (PACISE)
Kutztown University of PA, Kutztown, PA, April 1-2, 2016. Here
are the slides
from the talk.
Textbook slides
Chapter 1
(week 1 - overview)
Chapter 2
(week 1 - input)
Chapter 3
(week 3 - output)
Chapter 4 (rules & trees week
5, linear models & model trees week 9, Bayesian inference
week 11, clustering week 12)
Compilation
of Weka slides on Instance Based Learning and Clustering
A graph on informational entropy,
relates to building rules & decision trees.
A page describing Bayes theorem and
related matters.
A Bayes computer
for a 52-card deck is on acad at
~parson/DataMine/BayesCards.py
BayesNet
examples from the textbook.
Chapter 5 (5.1 - 5.5 week 8 -
evaluation)
Chapter 8
(week 6 - data transformations)
Chapter 12 on Ensemble Learning
We used
data from finalexam458fall2018.problem.zip
at ~parson/DataMine to demo ensemble learning on 4/13/2021.
Time-series
Data Analysis
I will draw some material from this
textbook as well.
Chapter 1 (overview)
Chapter
2 (overview)
Chapter 3A (data
exploration)
Chapter 3B (data
exploration)
Chapter 4A
(information-based learning)
Chapter 6A
(probability-based learning)
Chapter 6B
(probability-based learning)
Chapter 8A
(evaluation)
Chapter
8B (evaluation)
Appendix A (descriptive statistics
& data visualization)
Appendix B (introduction to
probability)
ASSIGNMENTS
Assignment
1, due 11:59 PM September 22 via make turnitin.
Red line added to
assignment spec on 9/22.
I added some per-STUDENT-requirement ipython
tutorial examples at that page's bottom on 9/4.
Added 9/12:
The comments for my findInName2Col(...) and
your findOutName2Col() incorrectly state:
map (Python dict) that maps the column number as an int
to the name of the attribute stripped of leading &
trailing blanks.
IT SHOULD SAY:
map (Python dict) that maps the name of the attribute
stripped of leading & trailing blanks to the column number
as an int.
My findInName2Col(...) is implemented correctly. Only
the comments are backwards.
Also, the error messages disappear in a useful order if
you complete STUDENT requirements 5 & 6 before 3 & 4,
since function normalizePerMinMaxMutateInPlace(normdatarows,
noNormalizeColSet) that contains 3 & 4 runs after 5 &
6.
Assignment 2 on
numeric regression due 11:59 PM October 14 via D2L.
Assignment 3
on data compression & discrete classification is due
11:59 PM November 3 via D2L.
Parson's discussion
of the Kappa statistic. Parson's solution
to spring 2021 CSC458 classification problem as a
tutorial.
Assignment
4 on classification of nominal values and
time-series analysis is due by 11:59 PM on Friday November
25 via D2L.
My related
research paper from 2006. Here is one related book
and then
another one.
Assignment
5 is a redo of one of Assignments 2, 3, or 4,
using new regressors and/or classifiers with new
configuration parameters.
It is due via D2L by end of
Thursday December 15. Our "final exam" class on
12/15 at 2 PM will be a work session.
ZOOM ARCHIVES
September
1 class was introductory overview of Python data &
control structures.
September
6 class Data
Analytics Certificate program, start Assignment
1 spec & code.
Reagan of CSC458 graciously agreed we could record our office-hour
tour of Putty & Notepad++ recorded during 9/7.
September
8 class we completed going over Assn1. Please start this
weekend as Tuesday will be a work session.
September
13 class was an assn1 work session, this video is 30 minutes
of Q&A starting with applying HINTS at the bottom of the
handout.
September
15 class certificate program, textbook slides chapter 1, and
some related examples from Hawk Mountain dataset.
September
20 class went over remaining slides from Chapter 1
then Chapter 2
and look at related
Hawk Mountain models.
September
22 class went over Chapter 3 slides and first into to
using Weka.
September
27 class went over handouts & example Weka usage for
Assignment 2. Thursday 9/29 will be a work session.
September
29 class brief Q&A during a work session.
October
4 class started going over prep materials beneath Assignment
3 above.
October
6 class went over remaining prep for Assignment 3, will send
it out within a week.
October
18 class went over my solution to Assignment 2, random
number generator seeds, and related.
October
20 class went over Assignment 3 handout & related topics
& Weka demo.
October
25 class went over Bayes / conditional probabilities as used
in model building.
October
27 class Naive Bayes assumption of statistical independence
of attributes, started ensemble model learning
November
1 class went over time-series data analysis and draft plans
for Assignment 4.
November 3 was a working session on Assignment 3.
November 8 class went over my Assignment 3 solution, Assignment 4
up to README Q1.
November
10 class more Assignment 4 including README then project
work session.
November
15 class went over Python cleaning & time-lagging script
used to prep Assignment 4.
November
17 class was my prerecorded video on Hawk Mtn. data analysis
from earlier in the day
plus some unrecorded Q&A after the video.
November 29 class (unlinked) Assignment 4 example solution,
handout Assignment 5, overview of Clustering.
December
6 class 3D and Parallel Attribute Graph
Visualizations of Assignment 4 Data
Parallel
Attributes book by Alfred
Inselberg. Some additional 5D attributes visualization.