CSC 458 - Data
Mining and Predictive Analytics I, Fall 2019
Sept 21, 2019 Field Trip
to Hawk Mountain, be at Visitor Center by 9 AM. (Photo
added 10/6/2019.)
The Weka download page has this note:
If your computer has a
display that has a high pixel density, and you are using
Windows, Weka's user interfaces may not be scaled
appropriately and appear tiny. Installing Java 9 or
later solves this problem. Alternatively, in the Program
menu of Weka's GUIChooser, go into Settings, and select
WindowsLookAndFeel from the "Look and feel for UI"
dropdown menu. Some Weka packages currently do not work
(properly) with Java 9 or later (tigerJython and
scatterPlot3D).
CSC480 Special topics course
in spring 2020:
This course increases breadth and depth of knowledge for
students with experience in object- oriented programming
for multimedia systems. Advanced topics include working
with camera point-of-view and lighting sources for 3D
graphics, recursive shapes and fractals, pixel-level
image processing, and animated video composition.
Students will program graphical images, video streams,
audio signals, physical devices containing electronic
sensors and effectors, and combinations of these media.
There will be solo and team programming projects.
Prerequisites: CSC220 with a grade of C or better.
(Presumably that prereq should have included "or
unconditional admission to the Graduate program.".)
First day handout
(syllabus that is specific to this semester).
Textbook: Data Mining: Practical Machine Learning Tools
and Techniques, Fourth Edition, Witten, et. al., ISBN
978-0128042915. You can buy a discounted copy of the 3rd
Edition at the KU Book Store -- either edition is fine.
I have put a copy of the 3rd edition of the textbook on
reserve in Rohrbach Library. You can go to the front desk
& borrow it overnight.
I commit to using each student's preferred name and preferred
gender pronoun. Feel free to contact me in private if I
make mistakes in pronunciation, name, gender, or anything
else. Thanks! Here is a poll to which you
can reply privately on paper or via email.
Gender-Based Crimes
Educators must report incidents of gender-based crimes,
including sexual assault, sexual harassment, stalking, dating
violence, and domestic violence. If a student discloses
such incidents to me during class or in a course assignment, I
am not required to report the disclosure, unless the student
was a minor at the time the incident occurred.
Regardless of the student’s age, if the incident is disclosed
to me outside the classroom setting or a course assignment, I
am required by law to report the disclosure, including
relevant details, such as the names of those involved in the
incident, to Public Safety and Police Services and to Mr.
Jesus Peña, Title IX Coordinator.
Jesus A. Peña, Esq.
Deputy to the President for Compliance, Equity & Legal
Affairs
(610) 683-4700
pena@kutztown.edu
There is a 10% per late late penalty for projects that come in
after the due date. There will be a 10% deduction from a
homework assignment for repeated web surfing, web-based
chatting or other use of the Internet for activities unrelated
to class activities during both lectures and working sessions.
During a working session you may leave after completing and
turning in all due work; you are encouraged to stay to get
additional practice and ask questions. Thank you.
RESOURCES & HANDOUTS. We will use research results
published by students & me to discuss various topics.
Here are textbook
slides from Kotu’s & Deshpande’s Predictive
Analytics and Data Mining: Concepts and Practice using
RapidMiner.
I found this book at start of semester,
will probably use it next time I teach the course.
It is at a more appropriate level for the
course, but the slides from all 3 textbooks cited on this page
stink.
However, I will use the slides from this
book, with my own additions, since they are generally better.
I adapted slides for Linear
Regression and M5P Trees, 10/30/2017.
There is an excellent on-line video course Predictive
Analytics Training with Weka (Introduction) by one of
our textbook authors & Weka creators.
Here is our
textbook's website. We will be using the
Weka tool set, which you can download
to your machine from here.
The PDF
Appendix to our textbook is here. It is a 128-page
tutorial on using Weka. Here is the Weka Wiki.
Additional
Weka documentation is here.
I will draw some material from
this textbook as well.
PYTHON.
How
to Think Like a Computer Scientist looks like a good
tutorial for Python newbies.
Dr.
Schwesinger has posted some additional Python textbooks
for CSC223.
We will be using the 2.x version of Python,
although assignment 1 is compatible with both 3.x and 2.x.
Try running python -V to see that you are
getting Pythin 2.6.x or 2.7.x as your default.
From the acad
machine do the following actions in bold:
Edit a file called .bash_profile
in your login directory (create it if needed) and add
these 2 lines near the top.
alias python2="/usr/bin/python"
Save the file and
exit the editor.
Now type this:
python2 #
type this to get the basic Python interpreter.
You
should see the interactive python interpreter that we will go
over in class.
If
you install python on your own machine, just running python
will get you the simpler-to-use interpreter.
The Python website is at http://www.python.org/.
There is a good on-line tutorial and
reference by Steven F. Lott called Building Skills in
Python. There
is a PDF copy here.
The IPython
site is here.
We will be using Python for data preparation in assignment 1.
We have Python installed on acad, but if
you want your own copy:
You can download Python 2.x
or 3.x from here. Use the most recent stable 2.x for
this course.
Documentation including tutorials
for the 2.x
library is here, for 3.x is here.
Here are my introductory
slides on Python. We will explore Python in class, so
please attend in person or via RTVC.
Using Notepad++: Go to Settings->Preferences...->Language
(since version 7.1) or Settings->Preferences...->Tab
Settings (previous versions)
Check Replace by space
To convert existing tabs to spaces, press
Edit->Blank Operations->TAB to Space.
If you are a vim editor user,
create a file called .vimrc in your login directory
with the following lines:
set ai
set ts=4
set sw=4
set expandtab
set sta
RESOURCES
The pythex utility for
testing Python regular expressions
D. Parson and A. Seidel, "Mining
Student Time Management Patterns in Programming Projects,"
Proceedings of FECS'14:
2014 Intl. Conf. on Frontiers in CS & CE Education,
Las Vegas, NV, July 21 - 24, 2014. Here are the slides
for the talk and the outline for the follow-up tutorial
"Using
Weka to Mine Temporal Work Patterns of Programming Students."
D. Parson, L. Bogumil & A. Seidel, "Data
Mining Temporal Work Patterns of Programming Student
Populations," Proceedings of the 30th Annual Spring
Conference of the Pennsylvania Computer and Information
Science Educators (PACISE)
Edinboro University of PA, Edinboro, PA, April 10-11, 2015.
Here are the slides
from the talk.
D. Parson, D. E. Hoch & H. Langley, "Timbral
Data Sonification from Parallel Attribute Graphs,"
Proceedings of the 31st Annual Spring Conference of the
Pennsylvania Computer and Information Science Educators
(PACISE) Kutztown University of PA, Kutztown, PA, April 1-2,
2016. Here are the slides
from the talk.
Textbook slides
Chapter 1 (week 1 -
overview)
Chapter 2 (week 1 -
input)
Chapter 3 (week 3 -
output)
Chapter 4 (rules
& trees week 5, linear models & model trees week 9,
Bayesian inference week 11, clustering week 12)
A graph
on informational entropy, relates to building rules
& decision trees.
A page
describing Bayes theorem and related matters.
A
Bayes computer for a 52-card deck is on acad at
~parson/DataMine/BayesCards.py
BayesNet
examples from the textbook.
Chapter 5 (5.1 -
5.5 week 8 - evaluation)
Chapter 8 (week 6 -
data transformations)
I will draw some material from
this textbook as well.
Chapter
1 (overview)
Chapter
2 (overview)
Chapter
3A (data exploration)
Chapter
3B (data exploration)
Chapter
4A (information-based learning)
Chapter
6A (probability-based learning)
Chapter
6B (probability-based learning)
Chapter 8A
(evaluation)
Chapter 8B
(evaluation)
Appendix
A (descriptive statistics & data visualization)
Appendix
B (introduction to probability)
ASSIGNMENTS
Assignment 1
Assignment
1, DUE 11:59 PM on Friday September 27 via "make
turnitin".
Here is the README.txt file and the example weatherToARFF2019.py.txt
from 2nd half of 9/11 class (not Zoomed).
Zoom
video (Aug 21 on Mac) on capturing weather data via
browser->Excel->CSV file, worth 20% of Assignment 1 (25
minutes).
Comments from 9/11 work session for
students working with CSV files having flaky Weather
Underground data fields:
I created notes from the 2nd half of
the 9/11 class that I extended 9/12 morning on acad in
~parson/DataMine/csc458fall2019assn1/README.txt. If you have
a set of Weather Underground CSV files that do not have
flaky fields, then following the handout instructions for
checking for blank fields or invalid units of measure will
work fine. If 'make test" already works for you, AND if you
have the checks for blank fields or invalid units of
measure, then you have completed the assignment. 'make test'
by itself is not enough, since the handout comments require
checking for blank fields or invalid units of measure.
HOWEVER, if you get a CSV file with flaky
data (one turned up last night), those instructions are not
enough. 'make teststudent' will fail due to setting ISERROR
to True, when your script exits. Therefore, do not set
ISERROR to True. Instead, write a message to sys.stderr and
set the output field to '?'. Again, if 'make test" already
works for you, AND if you have the checks for blank fields
or invalid units of measure, then you have completed the
assignment, and you don't need to bother with data-error
recovery. Your wunderground data was good. I highly
recommend reading all of
~parson/DataMine/csc458fall2019assn1/README.txt, regardless,
to understand the issues.
On 9/18 we will start new material on analyzing data
in ARFF files using Weka.
Assignment 2
Assignment
2 is due by 11:59 PM on October 19 via make turnitin.
Assignment
2 preview page with sample datasets. A
visualization of bird migration in Europe discovered by
Bryan McNally.
Here is my
page on interpreting the Kappa statistic needed for
Assignment 2. (UPDATED Oct 1, 2019)
Assignment 3
Assignment
3 on using linear models to predict numeric attributes
is due by 11:59 PM on Wednesday November 13.
Here are last
year's answers to their assignment 3. We will start
going over this October 9.
If you get this
warning in assignments when using a Supplied Test Set
(external .arff file) with more attributes than the Training
Set,
just answer "yes" to the
warning dialog. If the training set is a strict subset of the
test set with respect to attributes,
where the attribute
names and types are identical in both .arff files, this
auto-mapping works fine.
Assignment 4
Assignment
4 uses various models including Bayesians to consider
the importance of using an additional weather station.
Assignment 4 is due by 11:59 PM on
Wednesday December 4 via make turnitin.
I will NOT accept solutions to this
Assignment 4 after 9 AM on Friday December 6. I need to
turn back my solution.
Assignment 5
Assignment
5 is a cumulative, take-home exam project. It
is due by 11:59 PM on Wednesday December 11 via make
turnitin.
I will NOT accept
solutions to this Assignment 5 after noon on Thursday
December 12.
Please read the RULES
FOR THE FINAL in the handout.
Use command line on Mac to increase memory, do not use
the .app at all:
ku135515parson:~ parson$ alias
alias weka='java -server -Xmx4000M -jar
/Applications/weka-3-8-0/weka.jar'
alias wekanew='java -server -Xmx4000M -jar
/Applications/weka-3-8-2/weka.jar'
ku135515parson:~ parson$ ls -ld /Applications/weka*
drwxr-xr-x@ 16 parson admin 544 Apr 13
2016 /Applications/weka-3-8-0
drwxr-xr-x@ 3 parson
admin 102 Apr 13 2016
/Applications/weka-3-8-0-oracle-jvm.app
drwxr-xr-x@ 16 parson admin 544
Dec 21 2017 /Applications/weka-3-8-2
drwxr-xr-x@ 3 parson
admin 102 Dec 21 2017
/Applications/weka-3-8-2-oracle-jvm.app
ku135515parson:~ parson$ ls -l /Applications/weka-3-8-0
total 85240
-rw-r--r--@ 1 parson admin
35147 Apr 13 2016 COPYING
-rw-r--r--@ 1 parson admin
16171 Apr 13 2016 README
-rw-r--r--@ 1 parson admin 6621937 Apr
13 2016 WekaManual.pdf
drwxr-xr-x@ 57 parson
admin 1938 Apr 13 2016
changelogs
drwxr-xr-x@ 27 parson
admin 918 Apr 13
2016 data
drwxr-xr-x@ 17 parson
admin 578 Apr 13
2016 doc
-rw-r--r--@ 1 parson
admin 510 Apr 13
2016 documentation.css
-rw-r--r--@ 1 parson
admin 1863 Apr 13 2016
documentation.html
-rw-r--r--@ 1 parson admin
42900 Apr 13 2016 remoteExperimentServer.jar
-rw-r--r--@ 1 parson admin 10759024 Apr
13 2016 weka-src.jar
-rw-r--r--@ 1 parson admin
30414 Apr 13 2016 weka.gif
-rw-r--r--@ 1 parson admin
359270 Apr 13 2016 weka.ico
-rw-r--r--@ 1 parson admin 10997325 Apr
13 2016 weka.jar
-rw-r--r--@ 1 parson admin 14758799 Apr
13 2016 wekaexamples.zip
Slides
on Minimum Description Length and Evaluating
Numeric Prediction,
for the week of November 6, in
preparation for Assignment 3.
Read Sections 5.8 and 5.9 in the 3rd
Edition textbook, 5.9 and 5.10 in the 4th Edition,
on Evaluating
Numeric Prediction and the Minimum Description
Length principle.
Read textbook sections
on linear regression & M5P model trees to reinforce
previous lecture material.
Here is an IBM
site overview of interpreting linear regression.
ZOOM ARCHIVES
Zoom
video (Aug 21 on Mac) on capturing weather data via
browser->Excel->CSV file, worth 20% of Assignment 1 (12.5
minutes).
Zoom
video of August 28 class, covering intro to the course.
PC-based how-to on capturing weather data starts around 1:30
hour.
Zoom
video of September 4 class, covering slides for chapters 1
& 2, and most of assignment 1 handout / Python.
Zoom
video for first half of September 11 class, covering
makefile-driven testing and related issues.
Zoom
video for September 18 class, covering the analytics
certificate program and half of last year's assignment 2 / Weka
preprocessing.
Zoom
video for September 25 class, completing last year's
assignment 2 overview and surveying draft data for our
assignment 2.
Zoom
video for October 2 class, going over my solution to
assn1, assn2 handout, & textbook slides related to assn2,
kappa.
Zoom
video for October 9 class, starting last year's assn3 and
slides on linear regression, related error measures,
minimum
description length (MDL), and answering some questions during
assignment 2 work session near the end.
Zoom
video for October 16 class time posted October 12, covers
remainder of fall 2018's assignment 3 started on October 9,
linear
regression, M5P model tree, Random Tree & Random Forest
applied to a numeric target attribute
10-fold
cross validation versus separate test data sets, and other
material related this fall 2019's upcoming assignment 3.
Zoom
video of October 13, view this after October 12's above. I
added some techniques to the Oct. 12 lesson that we may use in
assn3.
I
also updated this
handout from last fall, pages 13-18, to go with this
recording.
Zoom
video for October 23 class, my solution to assignment 2,
and new assignment 3, some kappa discussion.
Zoom
video for October 30 class, on Naive Bayes and Bayes Net
approaches to data analysis.
Zoom
video for November 6 class, wrapping up Bayes, discussing
supervised Discretize, and exploring examples of K-means
clustering.
Zoom
video for November 13 work session, Q&A on nested
ifelse() in AddExpression, and other assignment 4 topics.
Scripting and
Extension Languages as Career Levers for CS&IT Graduates.
November 21 Research & Teaching Presentation.
Here
is a Zoom recording for that November 21 talk.
Zoom
video for November 20 class, covering Assignment 3
solution, Assignment 4 handout, instance-based learning &
clustering.
Zoom
video for December 4 class, handout of Assignment 5, and a
little Q&A on Assignment 4.
Zoom
video for December 11 final exam class, answered a few
Assignment 5 questions & went over my Assignment 4 solution.