CSC 558 - Data Mining and Predictive Analytics II, Fall 2021, Monday 6-8:50 PM in Old Main 158.

Dr. Dale E. Parson

Office Hours Monday 11/15 only are 12-1 and 3-4 PM to accommodate grad school fair.
Office Hours Wednesday 11/17 only are 2-4 PM to accommodate evening student job interviews.

Fall 2021 Office Hours: Monday 2-4, Wednesday 4-6 (Zoom only), Thursday 10-11 or by appt.
We will distance 6 feet in my office, so plan to attend office hours online.
Office hours Zoom:
Links for CSC558 Zoom room (also in our D2L page's Content tab) and student instructions for using Zoom.

First day handout (syllabus that is specific to this semester).
Textbook: Data Mining: Practical Machine Learning Tools and Techniques, Fourth Edition, Witten, et. al., ISBN 978-0128042915. You can buy a discounted copy of the 3rd Edition at the KU Book Store -- either edition is fine.
There are on-line copies of the Third Edition available in Rohrbach Library.
There are 4 courses or 3 courses + (research or internship) in our Graduate Data Analytics certificate program.
    You need to register free for the program, and you can use courses from a CSIT master's program.
    Talk to me if you are interested.

I commit to using each student's preferred name and preferred gender pronoun. Feel free to contact me in private if I make mistakes in pronunciation, name, gender, or anything else. Thanks!

Gender-Based Crimes
Educators must report incidents of gender-based crimes, including sexual assault, sexual harassment, stalking, dating violence, and domestic violence.  If a student discloses such incidents to me during class or in a course assignment, I am not required to report the disclosure, unless the student was a minor at the time the incident occurred.  Regardless of the student’s age, if the incident is disclosed to me outside the classroom setting or a course assignment, I am required by law to report the disclosure, including relevant details, such as the names of those involved in the incident, to Public Safety and Police Services and to Mr. Jesus Peña, Title IX Coordinator.
Jesus A. Peña, Esq.
Deputy to the President for Compliance, Equity & Legal Affairs
(610) 683-4700

There is a 10% per late late penalty for projects that come in after the due date. During a working session you may leave after completing and turning in all due work; you are encouraged to stay to get additional practice and ask questions. Thank you.

RESOURCES & HANDOUTS. We will use research results published by students & me to discuss various topics.

Link to the Spring 2021 CSC458 prerequisite course.

Slides leading to Assignment on Ensemble Learning.
    Fall 2017 CSC458 slides on evaluating numeric prediction.
    A summary of the Kappa Statistic.
    A subset of Chapter 5 on Evaluation and 7 on Data Transformations.
    Chapter 12 on Ensemble Learning.

Here are textbook slides from Kotu’s & Deshpande’s Predictive Analytics and Data Mining: Concepts and Practice using RapidMiner.

There is an excellent on-line video course Predictive Analytics Training with Weka (Introduction) by one of our textbook authors & Weka creators.

Here is our textbook's website. We will be using version 3.8.5 of the Weka tool set, which you can download to your machine from here.
    Here is where you can run it on campus PCs (S:\ComputerScience\WEKA):
    The PDF Appendix to our textbook is here. It is a 128-page tutorial on using Weka. Here is the Weka Wiki.
    Additional Weka documentation is here.
    I will draw some material from this textbook as well.


We may use the Kaggle site for a project some time this semester.

D. Parson and A. Seidel, "Mining Student Time Management Patterns in Programming Projects," Proceedings of FECS'14: 2014 Intl. Conf. on Frontiers in CS & CE Education, Las Vegas, NV, July 21 - 24, 2014. Here are the slides for the talk and the outline for the follow-up tutorial "Using Weka to Mine Temporal Work Patterns of Programming Students."

D. Parson, L. Bogumil & A. Seidel, "Data Mining Temporal Work Patterns of Programming Student Populations," Proceedings of the 30th Annual Spring Conference of the Pennsylvania Computer and Information Science Educators (PACISE) Edinboro University of PA, Edinboro, PA, April 10-11, 2015. Here are the slides from the talk.

D. Parson, D. E. Hoch & H. Langley, "Timbral Data Sonification from Parallel Attribute Graphs," Proceedings of the 31st Annual Spring Conference of the Pennsylvania Computer and Information Science Educators (PACISE) Kutztown University of PA, Kutztown, PA, April 1-2, 2016. Here are the slides from the talk.

Background for 11/15 presentation on prepping a framework for CSC523 projects using Weka-like workflow.
    Code on acad / mcgonagall under ~parson/DataMine/TimeRaptors4CSC523

Bash Shell Scripting for Data Science lecture slides for first hour 11/29/2021


To start Weka on campus PCs with S: mounted go to S:\ComputerScience\WEKA and click one of the batch files.

ASSIGNMENT 0: Stay Safe & Protect Your Neighbors in Fall 2021
    "Simulated Contact Tracing of COVID-19 Propagation at Kutztown University for Fall 2020" best PACISE 2021 faculty paper
        Slides for the updated talk. Fall 2021 update. Python R0 calculator.
            Simulation code at S:\ComputerScience\Parson\Processing3\CovidUFall2021V1MIDI
    KU Mask Adherence Process

ASSIGNMENT 1: Review of Classification, Introduction to Instance Based (Lazy) Learning
    Compilation of Weka slides on Instance Based Learning and Clustering
    Wissam Malke's thesis "Machine Listening with Very Small Training Datasets"
        Slides for his thesis
        Follow-up white paper "Mapping Data Visualization to Timbral Sonification and Machine Listening"
    Instance-Based Learning Algorithms, a paper from 1991.
    K*: An Instance-based Learner Using an Entropic Distance Measure, a paper from 1995.
    Locally Weighted Naive Bayes, a paper from 2012.
    A graph on informational entropy, relates to building rules, decision trees, and K*.
    Kappa statistic summary
    Audio signal overview for Assignment 1 (reused with data modifications for fall 2021).

ASSIGNMENT 1 on instance-based learning is due via D2L Assessments -> Assignments by end of 9/30.
    Late penalty is 10% per day starting 9 AM after the deadline.
    If your training and testing files have attributes ordered differently or tid is missing from training data you may see this pop-up:
    Just click Yes and you'll see the attribute mapping like this when you run the classifier.
    Here is some related Python code in a zip file for discussion in the September 20 class.
    Here are the AddExpression derived attribute distributions from email to class on 9/26.
    Here is my solution posted from class on October 4 (temporarily removed).
    Here is my Python script for deriving an alternate representation of these data.
        ARFF files are under S:\ComputerScience\parson/Weka\audioExample on the KU PC network.

ASSIGNMENT 2 will be on applying ensemble learning & previous techniques.
        It us due by end of Thursday October 21 via D2L Assignment 2.
        10% penalty per each day late without a medical excuse.
    Slides leading to Assignment on Ensemble Learning.
    CSC458 slides on evaluating numeric prediction.
    A summary of the Kappa Statistic.
    A subset of Chapter 5 on Evaluation and 7 on Data Transformations.
    Chapter 12 on Ensemble Learning.    
    Here is my solution.

ASSIGNMENT 3 on analyzing time-series data is due by 11:59 PM November 11 via D2L Assignment 3.
    Time series PPT slides for October 18.
    Kobayashi Maru relates to Q12's bonus points and winni9ng a no-win situation by redefining the rules of engagement.
        After two students stumbled onto this approach, I incorporated it into Q14 after exploring clustering of Q15.
ASSIGNMENTS 4 & 5 are individual student mini-research projects.
    ASSIGNMENT 4 is due via D2L on Friday November 26.
    ASSIGNMENT 5 materials are due via D2L on Sunday December 5, with presentations on the 6th & 13th.
    Liquid Interactive will make an award to the best overall project as described in the linked handout.


August 30 First day handout, semester overview, and walk-through of ASSIGNMENT 0 data-driven COVID@KU simulations.
September 13 Went over instance-based (lazy) machine learning & start Assignment 1.
    Zoom glitches due to Live Transcript. I will disable that in the future.
September 20 Went over various perspectives on time-periodic datasets and their frequency domain normalization & analysis.

    Here is some related Python code in a zip file for discussion in the September 20 class.   
       September 27 Assignment 1 pointers, slides on MDL and evaluating numeric prediction, Assignment 1 Q&A.
       October 4 (temporarily removed) went over solution to assn1 including alternative FFT encoding, then ensemble learning and assn2.
       October 18 went over slides on Time Series and then work / Q&A for Assignment 2.
       October 25 went over my solutionS with added data encodings for Assignment 2 and then got up to Q13 on new Assignment 3.
       November 1 some Assignment 3 Q&A, clustering on Q13-Q15, overview of individual Assignments 4 & 5.
       November 8 went over Python scripts for batch automating Weka runs then Q&A on Assignment 3.
         Scripts on acad /home/ in files
       November 23 (no video for November 15 Assignment 3 solution because one student's still out) ...
            Using OneR iteratively to find most important attributes;
            LinearRegression / M5P / Normalize guidelines with reference to Assignment 2.
       November 29 Went over
Bash Shell Scripting for Data Science slides then Q&A on assn5.
       December 6 First group of final student presentations.
       December 13 Second group of final student presentations.