CSC 458 - Data
Mining & Predictive Analytics I, Spring 2021, TuTh
4:30-5:50 PM.
Classes are all via Zoom at class time. Zoom
student
docs are here.
To watch a recording you *may* first need to go here https://kutztown.zoom.us/
and Sign In using KU login.
FINAL EXAM TIME
Thursday, May 6, 2 p.m. – 4 p.m. Q&A
on Assignment 5, no exam.
My office hours will not change during finals,
same as usual.
Dr. Dale E. Parson, http://faculty.kutztown.edu/parson
Class-time
Zoom link for CSC458 OR See D2L Course CSC458 ->
Content -> Overview for the link.
IF you don’t want to be recorded or are a minor, use
PRIVATE ZOOM CHAT to me for questions.
Please fill out & email Dr. Parson this
permission
to record slip. I will use it to take attendance in week
1.
The course is 100% via Zoom at class time. I will record &
post class videos, but want you there at class time. Thanks.
parson@kutztown.edu, Office hours: https://kutztown.zoom.us/j/94322223872
Office Hours Monday 2-4, Wednesday 1-3, Thursday 10-11 or by
appt.
First day handout
(syllabus that is specific to this semester).
KU Campus Mask policy: Resident students must wear a mask
anytime they are outside of their personal room and within a
building or with anyone else but their roommate. Commuter
students must wear a mask anytime they are on campus within a
building or with anyone. The course is 100% via Zoom at class
time. I will record & post class videos, but want you there
at class time.
PA: The Secretary's Order requires individuals to wear a
face covering, in both indoor public places and in the outdoors
when they are not able to consistently maintain social
distancing from individuals who are not members of their
household, such as on a busy sidewalk, waiting in line to enter
a place, or near others at any place people are congregating.
Whether inside in a public place or outside, and when wearing a
face covering or not, everyone should socially distance at least
6 feet apart from others who are not part of your household.
HANDOUTS
Windows users can download the WinSCP file
transfer client in the Computer Science sub-menu
below here. Textbook: Data Mining: Practical
Machine Learning Tools and Techniques, Fourth Edition,
Witten, et. al., ISBN 978-0128042915. You can buy a
discounted copy of the 3rd Edition at the KU Book Store --
either edition is fine. I have put a copy of the 3rd
edition of the textbook on reserve in Rohrbach Library. You can
go to the front desk & borrow it overnight.
I commit to using each student's preferred name and preferred
gender pronoun. Feel free to contact me in private if I make
mistakes in pronunciation, name, gender, or anything else.
Thanks! Here is a poll to which you can reply
privately on paper or via email.
Gender-Based Crimes
Educators must report incidents of gender-based crimes,
including sexual assault, sexual harassment, stalking, dating
violence, and domestic violence. If a student discloses
such incidents to me during class or in a course assignment, I
am not required to report the disclosure, unless the student was
a minor at the time the incident occurred. Regardless of
the student’s age, if the incident is disclosed to me outside
the classroom setting or a course assignment, I am required by
law to report the disclosure, including relevant details, such
as the names of those involved in the incident, to Public Safety
and Police Services and to Mr. Jesus Peña, Title IX Coordinator.
Jesus A. Peña, Esq.
Deputy to the President for Compliance, Equity & Legal
Affairs
(610) 683-4700
pena@kutztown.edu
There is a 10% per late late penalty for projects that come in
after the due date. There will be a 10% deduction from a
homework assignment for repeated web surfing, web-based chatting
or other use of the Internet for activities unrelated to class
activities during both lectures and working sessions. During a
working session you may leave after completing and turning in
all due work; you are encouraged to stay to get additional
practice and ask questions. Thank you.
RESOURCES & HANDOUTS. We will use research results
published by students & me to discuss various topics.
Here are textbook
slides from Kotu’s & Deshpande’s Predictive
Analytics and Data Mining: Concepts and Practice using
RapidMiner.
I found this book at start of semester, will
probably use it next time I teach the course.
It is at a more appropriate level for the
course, but the slides from all 3 textbooks cited on this page
stink.
However, I will use the slides from this
book, with my own additions, since they are generally better.
I adapted slides
for Linear Regression and M5P Trees, 10/30/2017.
There is an excellent on-line video course Predictive
Analytics Training with Weka (Introduction) by one of our
textbook authors & Weka creators.
Here is our
textbook's website. We will be using the Weka
tool set, which you can download
to your machine from here.
The PDF
Appendix to our textbook is here. It is a 128-page
tutorial on using Weka. Here is the Weka Wiki.
I will draw some material from this
textbook as well.
The Weka download page has this note:
If your computer has a display that has a high pixel density,
and you are using Windows, Weka's user interfaces may not be
scaled appropriately and appear tiny. Installing Java 9 or later
solves this problem. Alternatively, in the Program menu of
Weka's GUIChooser, go into Settings, and select
WindowsLookAndFeel from the "Look and feel for UI" dropdown
menu. Some Weka packages currently do not work (properly) with
Java 9 or later (tigerJython and scatterPlot3D).
PYTHON.
How
to Think Like a Computer Scientist looks like a good
tutorial for Python 3.x newbies.
Dr.
Schwesinger has posted some additional Python textbooks
for CSC223.
We will be using the 3.x version of Python.
Try running python -V to see that you are
getting Python 3.x.x as your default.
From the mcgonagall
machine (ssh mcgonagall from acad) do the following
actions in bold:
Edit a file called .bash_profile
in your login directory (create it if needed) and add these
2 lines near the top.
export PATH="/usr/local/bin:${PATH}"
alias python="/usr/local/bin/python3.7"
alias ipython="/usr/local/bin/ipython3"
Save the file
and exit the editor, log out and log back into mcgonagall.
Now type this:
python -V # You
should see this:
Python 3.7.7
If you
install python on your own machine, just running python will get
you the simpler-to-use interpreter.
I will
use ipython in lecture.
The Python website is at http://www.python.org/.
There is a good on-line tutorial and
reference by Steven F. Lott called Building Skills in Python.
There is a PDF copy here.
The IPython
site is here.
We will be using Python for data preparation in assignment 1.
We have Python installed on acad, but if you
want your own copy:
You can download Python 3.x
from here. Use the most recent stable 3.x for this course.
Documentation including tutorials for
the 3.x
library is here.
Here are my introductory
slides on Python. We will explore Python in class.
Using Notepad++: Go to Settings->Preferences...->Language
(since version 7.1) or Settings->Preferences...->Tab
Settings (previous versions)
Check Replace by space
To convert existing tabs to spaces, press
Edit->Blank Operations->TAB to Space.
If you are a vim editor user,
create a file called .vimrc in your login directory with
the following lines:
set ai
set ts=4
set sw=4
set expandtab
set sta
RESOURCES
The pythex utility for
testing Python regular expressions
D. Parson and A. Seidel, "Mining
Student Time Management Patterns in Programming Projects,"
Proceedings of FECS'14:
2014 Intl. Conf. on Frontiers in CS & CE Education,
Las Vegas, NV, July 21 - 24, 2014. Here are the slides
for the talk and the outline for the follow-up tutorial "Using
Weka to Mine Temporal Work Patterns of Programming Students."
D. Parson, L. Bogumil & A. Seidel, "Data
Mining Temporal Work Patterns of Programming Student
Populations," Proceedings of the 30th Annual Spring
Conference of the Pennsylvania Computer and Information Science
Educators (PACISE) Edinboro
University of PA, Edinboro, PA, April 10-11, 2015. Here are the
slides
from the talk.
D. Parson, D. E. Hoch & H. Langley, "Timbral
Data Sonification from Parallel Attribute Graphs,"
Proceedings of the 31st Annual Spring Conference of the
Pennsylvania Computer and Information Science Educators (PACISE)
Kutztown University of PA, Kutztown, PA, April 1-2, 2016. Here
are the slides
from the talk.
Textbook slides
Chapter 1 (week 1 - overview)
Chapter 2 (week 1 - input)
Chapter 3 (week 3 - output)
Chapter 4 (rules & trees week
5, linear models & model trees week 9, Bayesian inference
week 11, clustering week 12)
A graph on informational entropy,
relates to building rules & decision trees.
A page
describing Bayes theorem and related matters.
A Bayes
computer for a 52-card deck is on acad at
~parson/DataMine/BayesCards.py
BayesNet examples from the
textbook.
Chapter 5 (5.1 - 5.5 week 8 -
evaluation)
A draft
discussion of Mean Absolute Error (MAE) and Root Mean Squared
Error (RMSE) applied to nominal classification.
Chapter 8 (week 6 - data
transformations)
Chapter 12 on Ensemble Learning
for Spring 2021
We will
use data from finalexam458fall2018.problem.zip
at ~parson/DataMine to demo ensemble learning on 4/13.
I will draw some material from this
textbook as well.
Chapter
1 (overview)
Chapter
2 (overview)
Chapter 3A (data
exploration)
Chapter 3B (data
exploration)
Chapter 4A
(information-based learning)
Chapter 6A
(probability-based learning)
Chapter 6B
(probability-based learning)
Chapter 8A
(evaluation)
Chapter
8B (evaluation)
Appendix A (descriptive statistics
& data visualization)
Appendix B (introduction to
probability)
ASSIGNMENTS
Assignment 1
From acad you must "ssh mcgonagall" to test
Assignment 1.
Assignment
1 uses Python regular expressions to parse a textual data
file and write it to a comma-separated value file.
It is due by 11:59 PM
on Saturday February 13 via make turnitin.
https://pythex.org/
is a very valuable interactive tool and https://docs.python.org/3/library/re.html
is library documentation.
Assignment 2 is
due via D2L Assignment 2 web page by 11:59 PM on Saturday
March 13.
Here is fall
2019's Assignment 2 with answers as sample for
Feb. 16 & 18 class, my page on interpreting Kappa
statistic.
The fall
2019 Assignment 3 solution Figs. 1-3 show a custom step
function for discretizing the BW counts.
Here is the handout JoinedHawkMtn20172018.arff
file & the student-edited FilteredCSC458assn2.arff
from that project.
Download
Weka 3.8.5 (the latest stable version) to work on your own
machine. There is a Windows copy on the campus network.
Here is my page on interpreting
the Kappa statistic needed for Assignment 2.
Here are my answers to
Assignment 2.
Assignment 3 is
due via D2L Assignment 3 web page by 11:59 PM on Friday
April 2.
Here are my
answers to Assignment 3.
Assignment
4 is due via D2L Assignment 4 by 11:59 PM on
Friday April 23.
Here are my answers to
Assignment 4.
Assignment
5 is due via D2L Assignment 5 by 11:59 PM on
Sunday May 9.
Chapter 12 on Ensemble Learning
for Spring 2021 in PDF format for reference on ensemble
classifiers.
See Bagging &
Boosting pseudocode on pages 7 & 13 of these slides.
I will not accept assignments after 9 AM
Monday May 10. Please do not procrastinate.
This is in place of a final exam. I will
not answer questions except in class & the final exam
period.
I will not answer technical questions. We
have already covered the technical concepts in this
assignment.
I will only clarify confusing questions
& fix ambiguous text.
Our final exam period is Thursday, May
6, 2 p.m. – 4 p.m.
Use command line on Mac to increase memory, do not use
the .app at all:
ku135515parson:~ parson$ alias
alias weka='java -server -Xmx4000M -jar
/Applications/weka-3-8-0/weka.jar'
alias wekanew='java -server -Xmx4000M -jar
/Applications/weka-3-8-2/weka.jar'
ku135515parson:~ parson$ ls -ld /Applications/weka*
drwxr-xr-x@ 16 parson admin 544 Apr 13 2016
/Applications/weka-3-8-0
drwxr-xr-x@ 3 parson admin 102 Apr
13 2016 /Applications/weka-3-8-0-oracle-jvm.app
drwxr-xr-x@ 16 parson
admin 544 Dec 21 2017 /Applications/weka-3-8-2
drwxr-xr-x@ 3 parson
admin 102 Dec 21 2017
/Applications/weka-3-8-2-oracle-jvm.app
ku135515parson:~ parson$ ls -l /Applications/weka-3-8-0
total 85240
-rw-r--r--@ 1 parson admin
35147 Apr 13 2016 COPYING
-rw-r--r--@ 1 parson admin
16171 Apr 13 2016 README
-rw-r--r--@ 1 parson admin 6621937 Apr
13 2016 WekaManual.pdf
drwxr-xr-x@ 57 parson admin
1938 Apr 13 2016 changelogs
drwxr-xr-x@ 27 parson
admin 918 Apr 13 2016
data
drwxr-xr-x@ 17 parson
admin 578 Apr 13 2016
doc
-rw-r--r--@ 1 parson
admin 510 Apr 13 2016
documentation.css
-rw-r--r--@ 1 parson
admin 1863 Apr 13 2016
documentation.html
-rw-r--r--@ 1 parson admin
42900 Apr 13 2016 remoteExperimentServer.jar
-rw-r--r--@ 1 parson admin 10759024 Apr
13 2016 weka-src.jar
-rw-r--r--@ 1 parson admin
30414 Apr 13 2016 weka.gif
-rw-r--r--@ 1 parson admin 359270
Apr 13 2016 weka.ico
-rw-r--r--@ 1 parson admin 10997325 Apr
13 2016 weka.jar
-rw-r--r--@ 1 parson admin 14758799 Apr
13 2016 wekaexamples.zip
Slides on Minimum
Description Length and Evaluating Numeric
Prediction,
for the week of November 6, in
preparation for Assignment 3.
Read Sections 5.8 and 5.9 in the 3rd Edition
textbook, 5.9 and 5.10 in the 4th Edition,
on Evaluating Numeric
Prediction and the Minimum Description Length
principle.
Read textbook sections on
linear regression & M5P model trees to reinforce previous
lecture material.
Here is an IBM
site overview of interpreting linear regression.
ZOOM ARCHIVES
January
19, Intro to course and a survey of past data mining
projects at KU.
January
21, Went through Chapter 1 slides on overview and surveyed
projects done at KU.
January
26, Went over basic Python if-while-for control constructs
and container types. Assn1
TBD Thursday Jan 28.
January
28, Went over Assignment
1 and use of https://pythex.org/
for interactive regular expression debugging.
February
1, Office hours detailed steps for using mcgonagall
& completing all requirements in Assignment 1.
February
2, Q&A for Assignment
1, will start new material February 4.
February
4, Went over slides for Chapter
2 and related this back to past data analysis projects at
KU.
February
9, Started Chapter 3 slides
through Decision Trees, some Weka demo.
CARESTEM458B.arff.txt is the
Weka data file demoed in class. Save it as CARESTEM458B.arff for
Weka use.
February 16 (link removed)
Assignment 1 debriefing, lead up to fall 2019 Assignment 2 walk
through, will continue 2/18.
February
18, Walked through start
of Fall 2019 Assignment 2 up to Discretize w.
UseEqualFrequency on page 9.
February
23, Finished walk through of Fall 2019 Assignment 2,
examined kappa
statistic.
February
25, Went over Assignment 2 handout above.
March
2, Weka nominal classification for this
analysis and this
analysis of programming student behavior-to-grade.
March
4, Weka numeric classification demo of March 2 dataset
focusing on Linear Regression & M5P model trees.
March
9, Demoed pros & cons of normalization of attributes
into the range [0.0, 1.0] for regression,
went over
correlation coefficient and other evaluation metrics including
kappa, last 45 minutes were assn2 Q&A.
March 11, Personal day
for me, no class, please watch July 2020 video of simulation
COVID@KU for data analysis.
March
16, Went over my 4 solutions to Assignment 2 & expanded
on discretization and over-fitting discussions.
March
18, Went over Assignment 3 handout.
March
23, Went over discretization and other preprocessing,
filtering, & derived attributes, Chapter 8 slides.
March
25, Went over basic statistics, unconditional and
conditional (Bayesian) probabilities.
My BayesCards.py
script from acad ~parson/DataMine/BayesCard.py. A screen shot from
the Zoom recording.
March
30, Went over Naive Bayes examples from
this paper & interactive Weka and my BayesCards.py
script.
April 1, Work Q&A
session not recorded.
April
6, Went over Parson solution to Assignment 3, handed out
Assignment 4.
April
8, Went over new Assignment 4 and looked at Naive Bayes
& Bayes Net tables.
April
13, started Chapter 12 slides Ensemble Learning including
Bagging & RandomForest demos in Weka.
April
15, finish discussion of Ensemble Learning in preparation
for Assignment 5.
April
22, Bryan McNally presentation on Hawk Mtn climate-flight
height correlation, partial overview of Assn5.
April
27, Parson's solution to Assignment 4, start Assignment 5.
April
29, Completed going over handouts & Q&A for
Assignment 5.
May
9, Some Q&A about Assignment 5 from our final "exam"
class.