CSC 458 - Data Mining & Predictive Analytics I, Spring 2021, TuTh 4:30-5:50 PM.
Classes are all via Zoom at class time. Zoom student docs are here.
To watch a recording you *may* first need to go here https://kutztown.zoom.us/ and Sign In using KU login.


FINAL EXAM TIME
Thursday, May 6, 2 p.m. – 4 p.m. Q&A on Assignment 5, no exam.
My office hours will not change during finals, same as usual.

Dr. Dale E. Parson
, http://faculty.kutztown.edu/parson
Class-time Zoom link for CSC458 OR See D2L Course CSC458 -> Content -> Overview for the link.
IF you don’t want to be recorded or are a minor, use PRIVATE ZOOM CHAT to me for questions.
Please fill out & email Dr. Parson this permission to record slip. I will use it to take attendance in week 1.
The course is 100% via Zoom at class time. I will record & post class videos, but want you there at class time. Thanks.
 
parson@kutztown.edu, Office hours: https://kutztown.zoom.us/j/94322223872
Office Hours Monday 2-4, Wednesday 1-3, Thursday 10-11 or by appt.

First day handout (syllabus that is specific to this semester).

KU Campus Mask policy: Resident students must wear a mask anytime they are outside of their personal room and within a building or with anyone else but their roommate. Commuter students must wear a mask anytime they are on campus within a building or with anyone. The course is 100% via Zoom at class time. I will record & post class videos, but want you there at class time.

PA: The Secretary's Order requires individuals to wear a face covering, in both indoor public places and in the outdoors when they are not able to consistently maintain social distancing from individuals who are not members of their household, such as on a busy sidewalk, waiting in line to enter a place, or near others at any place people are congregating. Whether inside in a public place or outside, and when wearing a face covering or not, everyone should socially distance at least 6 feet apart from others who are not part of your household.

HANDOUTS


Windows users can download the WinSCP file transfer client in the Computer Science sub-menu below here.  Textbook: Data Mining: Practical Machine Learning Tools and Techniques, Fourth Edition, Witten, et. al., ISBN 978-0128042915. You can buy a discounted copy of the 3rd Edition at the KU Book Store -- either edition is fine. I have put a copy of the 3rd edition of the textbook on reserve in Rohrbach Library. You can go to the front desk & borrow it overnight.

I commit to using each student's preferred name and preferred gender pronoun. Feel free to contact me in private if I make mistakes in pronunciation, name, gender, or anything else. Thanks! Here is a poll to which you can reply privately on paper or via email.

Gender-Based Crimes
Educators must report incidents of gender-based crimes, including sexual assault, sexual harassment, stalking, dating violence, and domestic violence.  If a student discloses such incidents to me during class or in a course assignment, I am not required to report the disclosure, unless the student was a minor at the time the incident occurred.  Regardless of the student’s age, if the incident is disclosed to me outside the classroom setting or a course assignment, I am required by law to report the disclosure, including relevant details, such as the names of those involved in the incident, to Public Safety and Police Services and to Mr. Jesus Peña, Title IX Coordinator.
Jesus A. Peña, Esq.
Deputy to the President for Compliance, Equity & Legal Affairs
(610) 683-4700
pena@kutztown.edu

There is a 10% per late late penalty for projects that come in after the due date. There will be a 10% deduction from a homework assignment for repeated web surfing, web-based chatting or other use of the Internet for activities unrelated to class activities during both lectures and working sessions. During a working session you may leave after completing and turning in all due work; you are encouraged to stay to get additional practice and ask questions. Thank you.
 
RESOURCES & HANDOUTS. We will use research results published by students & me to discuss various topics.

Here are textbook slides from Kotu’s & Deshpande’s Predictive Analytics and Data Mining: Concepts and Practice using RapidMiner.
    I found this book at start of semester, will probably use it next time I teach the course.
    It is at a more appropriate level for the course, but the slides from all 3 textbooks cited on this page stink.
    However, I will use the slides from this book, with my own additions, since they are generally better.
    I adapted slides for Linear Regression and M5P Trees, 10/30/2017.

There is an excellent on-line video course Predictive Analytics Training with Weka (Introduction) by one of our textbook authors & Weka creators.

Here is our textbook's website. We will be using the Weka tool set, which you can download to your machine from here.
    The PDF Appendix to our textbook is here. It is a 128-page tutorial on using Weka. Here is the Weka Wiki.
    I will draw some material from this textbook as well.

The Weka download page has this note:
If your computer has a display that has a high pixel density, and you are using Windows, Weka's user interfaces may not be scaled appropriately and appear tiny. Installing Java 9 or later solves this problem. Alternatively, in the Program menu of Weka's GUIChooser, go into Settings, and select WindowsLookAndFeel from the "Look and feel for UI" dropdown menu. Some Weka packages currently do not work (properly) with Java 9 or later (tigerJython and scatterPlot3D).

PYTHON.
    How to Think Like a Computer Scientist looks like a good tutorial for Python 3.x newbies.
    Dr. Schwesinger has posted some additional Python textbooks for CSC223.
    We will be using the 3.x version of Python.
    Try running python -V to see that you are getting Python 3.x.x as your default.
        From the mcgonagall machine (ssh mcgonagall from acad) do the following actions in bold:
        Edit a file called .bash_profile in your login directory (create it if needed) and add these 2 lines near the top.
                export PATH="/usr/local/bin:${PATH}"
                alias python="/usr/local/bin/python3.7"
                alias ipython="/usr/local/bin/ipython3"
        Save the file and exit the editor, log out and log back into mcgonagall.
        Now type this:
                python -V    # You should see this:
                    Python 3.7.7
            If you install python on your own machine, just running python will get you the simpler-to-use interpreter.
            I will use ipython in lecture.

    The Python website is at http://www.python.org/.
    There is a good on-line tutorial and reference by Steven F. Lott called Building Skills in Python. There is a PDF copy here.
    The IPython site is here.
We will be using Python for data preparation in assignment 1.
    We have Python installed on acad, but if you want your own copy:
    You can download Python 3.x from here. Use the most recent stable 3.x for this course.
    Documentation including tutorials for the 3.x library is here.

Here are my introductory slides on Python. We will explore Python in class.
    Using Notepad++: Go to Settings->Preferences...->Language (since version 7.1) or Settings->Preferences...->Tab Settings (previous versions)
    Check Replace by space    
    To convert existing tabs to spaces, press Edit->Blank Operations->TAB to Space.
    If you are a vim editor user, create a file called .vimrc in your login directory with the following lines:
        set ai
        set ts=4
        set sw=4
        set expandtab
        set sta

RESOURCES

The pythex utility for testing Python regular expressions

D. Parson and A. Seidel, "Mining Student Time Management Patterns in Programming Projects," Proceedings of FECS'14: 2014 Intl. Conf. on Frontiers in CS & CE Education, Las Vegas, NV, July 21 - 24, 2014. Here are the slides for the talk and the outline for the follow-up tutorial "Using Weka to Mine Temporal Work Patterns of Programming Students."

D. Parson, L. Bogumil & A. Seidel, "Data Mining Temporal Work Patterns of Programming Student Populations," Proceedings of the 30th Annual Spring Conference of the Pennsylvania Computer and Information Science Educators (PACISE) Edinboro University of PA, Edinboro, PA, April 10-11, 2015. Here are the slides from the talk.

D. Parson, D. E. Hoch & H. Langley, "Timbral Data Sonification from Parallel Attribute Graphs," Proceedings of the 31st Annual Spring Conference of the Pennsylvania Computer and Information Science Educators (PACISE) Kutztown University of PA, Kutztown, PA, April 1-2, 2016. Here are the slides from the talk.

Textbook slides
        Chapter 1 (week 1 - overview)
        Chapter 2 (week 1 - input)
        Chapter 3 (week 3 - output)
        Chapter 4 (rules & trees week 5, linear models & model trees week 9, Bayesian inference week 11, clustering week 12)
            A graph on informational entropy, relates to building rules & decision trees.
            A page describing Bayes theorem and related matters.
            A Bayes computer for a 52-card deck is on acad at ~parson/DataMine/BayesCards.py
            BayesNet examples from the textbook.
        Chapter 5 (5.1 - 5.5 week 8 - evaluation)
            A draft discussion of Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) applied to nominal classification.
        Chapter 8 (week 6 - data transformations)
        Chapter 12 on Ensemble Learning for Spring 2021
            We will use data from finalexam458fall2018.problem.zip at ~parson/DataMine to demo ensemble learning on 4/13.

I will draw some material from this textbook as well.
        Chapter 1 (overview)
        Chapter 2 (overview)
        Chapter 3A (data exploration)
        Chapter 3B (data exploration)
        Chapter 4A (information-based learning)
        Chapter 6A (probability-based learning)
        Chapter 6B (probability-based learning)
        Chapter 8A (evaluation)
        Chapter 8B (evaluation)
        Appendix A (descriptive statistics & data visualization)
        Appendix B (introduction to probability)  

ASSIGNMENTS

Assignment 1
    From acad you must "ssh mcgonagall" to test Assignment 1.
    Assignment 1 uses Python regular expressions to parse a textual data file and write it to a comma-separated value file.
        It is due by 11:59 PM on Saturday February 13 via make turnitin.
    https://pythex.org/ is a very valuable interactive tool and https://docs.python.org/3/library/re.html is library documentation.

Assignment 2 is due via D2L Assignment 2 web page by 11:59 PM on Saturday March 13.
    Here is fall 2019's Assignment 2 with answers as sample for Feb. 16 & 18 class, my page on interpreting Kappa statistic.
    The fall 2019 Assignment 3 solution Figs. 1-3 show a custom step function for discretizing the BW counts.
    Here is the handout JoinedHawkMtn20172018.arff file & the student-edited FilteredCSC458assn2.arff from that project.
    Download Weka 3.8.5 (the latest stable version) to work on your own machine. There is a Windows copy on the campus network.
    Here is my page on interpreting the Kappa statistic needed for Assignment 2.
    Here are my answers to Assignment 2.

Assignment 3 is due via D2L Assignment 3 web page by 11:59 PM on Friday April 2.
    Here are my answers to Assignment 3.

Assignment 4 is due via D2L Assignment 4 by 11:59 PM on Friday April 23.
    Here are my answers to Assignment 4.

Assignment 5 is due via D2L Assignment 5 by 11:59 PM on Sunday May 9.
    Chapter 12 on Ensemble Learning for Spring 2021 in PDF format for reference on ensemble classifiers.
        See Bagging & Boosting pseudocode on pages 7 & 13 of these slides.
    I will not accept assignments after 9 AM Monday May 10. Please do not procrastinate.
    This is in place of a final exam. I will not answer questions except in class & the final exam period.
    I will not answer technical questions. We have already covered the technical concepts in this assignment.
    I will only clarify confusing questions & fix ambiguous text.
    Our final exam period is Thursday, May 6, 2 p.m. – 4 p.m.

Use command line on Mac to increase memory, do not use the .app at all:
ku135515parson:~ parson$ alias
alias weka='java -server -Xmx4000M -jar /Applications/weka-3-8-0/weka.jar'
alias wekanew='java -server -Xmx4000M -jar /Applications/weka-3-8-2/weka.jar'
ku135515parson:~ parson$ ls -ld /Applications/weka*
drwxr-xr-x@ 16 parson  admin  544 Apr 13  2016 /Applications/weka-3-8-0
drwxr-xr-x@  3 parson  admin  102 Apr 13  2016 /Applications/weka-3-8-0-oracle-jvm.app
drwxr-xr-x@ 16 parson  admin  544 Dec 21  2017 /Applications/weka-3-8-2
drwxr-xr-x@  3 parson  admin  102 Dec 21  2017 /Applications/weka-3-8-2-oracle-jvm.app
ku135515parson:~ parson$ ls -l /Applications/weka-3-8-0
total 85240
-rw-r--r--@  1 parson  admin     35147 Apr 13  2016 COPYING
-rw-r--r--@  1 parson  admin     16171 Apr 13  2016 README
-rw-r--r--@  1 parson  admin   6621937 Apr 13  2016 WekaManual.pdf
drwxr-xr-x@ 57 parson  admin      1938 Apr 13  2016 changelogs
drwxr-xr-x@ 27 parson  admin       918 Apr 13  2016 data
drwxr-xr-x@ 17 parson  admin       578 Apr 13  2016 doc
-rw-r--r--@  1 parson  admin       510 Apr 13  2016 documentation.css
-rw-r--r--@  1 parson  admin      1863 Apr 13  2016 documentation.html
-rw-r--r--@  1 parson  admin     42900 Apr 13  2016 remoteExperimentServer.jar
-rw-r--r--@  1 parson  admin  10759024 Apr 13  2016 weka-src.jar
-rw-r--r--@  1 parson  admin     30414 Apr 13  2016 weka.gif
-rw-r--r--@  1 parson  admin    359270 Apr 13  2016 weka.ico
-rw-r--r--@  1 parson  admin  10997325 Apr 13  2016 weka.jar
-rw-r--r--@  1 parson  admin  14758799 Apr 13  2016 wekaexamples.zip

Slides on Minimum Description Length and Evaluating Numeric Prediction,
    for the week of November 6, in preparation for Assignment 3.
    Read Sections 5.8 and 5.9 in the 3rd Edition textbook, 5.9 and 5.10 in the 4th Edition,
        on Evaluating Numeric Prediction and the Minimum Description Length principle.
        Read textbook sections on linear regression & M5P model trees to reinforce previous lecture material.
        Here is an IBM site overview of interpreting linear regression.

    ZOOM ARCHIVES
        January 19, Intro to course and a survey of past data mining projects at KU.
        January 21, Went through Chapter 1 slides on overview and surveyed projects done at KU.
        January 26, Went over basic Python if-while-for control constructs and container types. Assn1 TBD Thursday Jan 28.
        January 28, Went over Assignment 1 and use of https://pythex.org/ for interactive regular expression debugging.
        February 1, Office hours detailed steps for using mcgonagall & completing all requirements in Assignment 1.
        February 2, Q&A for Assignment 1, will start new material February 4.
        February 4, Went over slides for Chapter 2 and related this back to past data analysis projects at KU.
        February 9, Started Chapter 3 slides through Decision Trees, some Weka demo.
            CARESTEM458B.arff.txt is the Weka data file demoed in class. Save it as CARESTEM458B.arff for Weka use.
        February 16 (link removed) Assignment 1 debriefing, lead up to fall 2019 Assignment 2 walk through, will continue 2/18.
        February 18, Walked through start of Fall 2019 Assignment 2 up to Discretize w. UseEqualFrequency on page 9.
        February 23, Finished walk through of Fall 2019 Assignment 2, examined kappa statistic.
        February 25, Went over Assignment 2 handout above.
        March 2, Weka nominal classification for this analysis and this analysis of programming student behavior-to-grade.
        March 4, Weka numeric classification demo of March 2 dataset focusing on Linear Regression & M5P model trees.
        March 9, Demoed pros & cons of normalization of attributes into the range [0.0, 1.0] for regression,
            went over correlation coefficient and other evaluation metrics including kappa, last 45 minutes were assn2 Q&A.
        March 11, Personal day for me, no class, please watch July 2020 video of simulation COVID@KU for data analysis.
        March 16, Went over my 4 solutions to Assignment 2 & expanded on discretization and over-fitting discussions.
        March 18, Went over Assignment 3 handout.
        March 23, Went over discretization and other preprocessing, filtering, & derived attributes, Chapter 8 slides.
        March 25, Went over basic statistics, unconditional and conditional (Bayesian) probabilities.
            My BayesCards.py script from acad ~parson/DataMine/BayesCard.py. A screen shot from the Zoom recording.
        March 30, Went over Naive Bayes examples from this paper & interactive Weka and my BayesCards.py script.
        April 1, Work Q&A session not recorded.
        April 6, Went over Parson solution to Assignment 3, handed out Assignment 4.
        April 8, Went over new Assignment 4 and looked at Naive Bayes & Bayes Net tables.
        April 13, started Chapter 12 slides Ensemble Learning including Bagging & RandomForest demos in Weka.
        April 15, finish discussion of Ensemble Learning in preparation for Assignment 5.
        April 22, Bryan McNally presentation on Hawk Mtn climate-flight height correlation, partial overview of Assn5.
        April 27, Parson's solution to Assignment 4, start Assignment 5.
        April 29, Completed going over handouts & Q&A for Assignment 5.
        May 9, Some Q&A about Assignment 5 from our final "exam" class.