CSC 558 - Data Mining and Predictive Analytics II, Spring 2023, Wed 6-8:50 PM in Old Main 158.

IT outages planned for spring break

Dr. Dale E. Parson

Office Hours Monday 3-5 PM, Tuesday 3-4 PM, Friday (Zoom only) 3-5 PM, or by appt.
Monday & Tuesday office hours are either Zoom using the above link or at Old Main 260.

Friday April 28 Zoom office hours will be at 12 - 2 PM because of the KTech picnic.

 
 Office hours Zoom: https://kutztown.zoom.us/j/94322223872

Link for CSC558 Zoom room (also in our D2L page's Content tab) and student instructions for using Zoom.
PLEASE FILL OUT & EMAIL ME THIS FORM THE FIRST WEEK OF CLASS.

First day handout (syllabus that is specific to this semester).
Textbook: Data Mining: Practical Machine Learning Tools and Techniques, Fourth Edition, Witten, et. al., ISBN 978-0128042915. You can buy a discounted copy of the 3rd Edition at the KU Book Store -- either edition is fine.
There are on-line copies of the Third Edition available in Rohrbach Library.
There are 4 courses or 3 courses + (research or internship) in our Graduate Data Analytics certificate program.
    You need to register free for the program, and you can use courses from a CSIT master's program.
    Talk to me if you are interested.

I commit to using each student's preferred name and preferred gender pronoun. Feel free to contact me in private if I make mistakes in pronunciation, name, gender, or anything else. Thanks!

Gender-Based Crimes
Educators must report incidents of gender-based crimes, including sexual assault, sexual harassment, stalking, dating violence, and domestic violence.  If a student discloses such incidents to me during class or in a course assignment, I am not required to report the disclosure, unless the student was a minor at the time the incident occurred.  to me outside the classroom setting or a course assignment, I am required by law to report the disclosure, including relevant details, such as the names of those involved in the incident, to Public Safety and Police Services and to Mr. Deputy to the President for Compliance, Equity & Legal Affairs
(610) 683-4700
pena@kutztown.edu

There is a 10% per late late penalty for projects that come in after the due date. During a working session you may leave after completing and turning in all due work; you are encouraged to stay to get additional practice and ask questions. Thank you.

 
RESOURCES & HANDOUTS. We will use research results published by students & me to discuss various topics.

Link to the Fall 2022 CSC458 course.
Jan. 25 Weka overview using data from acad's ~parson/DataMine/csc570spring2023/month_aggregate_HMS_goodyears.arff.gz
    Data is a superset of that used in Fall 2022 CSC458 Assignment 3.
    See Section 3 of ongoing Analysis of Hawk Mountain Sanctuary Observation Data from 1976 through 2021.

Slides leading to Assignment on Ensemble Learning.
    Weka slides on evaluating numeric prediction.
    A summary of the Kappa Statistic.
    A subset of Chapter 5 on Evaluation and 7 on Data Transformations.
    Chapter 12 on Ensemble Learning.

Here is our textbook's website. We will be using version 3.8.5 of the Weka tool set, which you can download to your machine from here.
    Here is where you can run it on campus PCs (S:\ComputerScience\WEKA):
    S:\ComputerScience\WEKA
    The PDF Appendix to our textbook is here. It is a 128-page tutorial on using Weka. Here is the Weka Wiki.
    Additional Weka documentation is here.
    I will draw some material from this textbook as well.

        

We may use the Kaggle site for a project some time this semester.

D. Parson and A. Seidel, "Mining Student Time Management Patterns in Programming Projects," Proceedings of FECS'14: 2014 Intl. Conf. on Frontiers in CS & CE Education, Las Vegas, NV, July 21 - 24, 2014. Here are the slides for the talk and the outline for the follow-up tutorial "Using Weka to Mine Temporal Work Patterns of Programming Students."

D. Parson, L. Bogumil & A. Seidel, "Data Mining Temporal Work Patterns of Programming Student Populations," Proceedings of the 30th Annual Spring Conference of the Pennsylvania Computer and Information Science Educators (PACISE) Edinboro University of PA, Edinboro, PA, April 10-11, 2015. Here are the slides from the talk.

D. Parson, D. E. Hoch & H. Langley, "Timbral Data Sonification from Parallel Attribute Graphs," Proceedings of the 31st Annual Spring Conference of the Pennsylvania Computer and Information Science Educators (PACISE) Kutztown University of PA, Kutztown, PA, April 1-2, 2016. Here are the slides from the talk.


Background for 11/15 presentation on prepping a framework for CSC523 projects using Weka-like workflow.
    Code on acad / mcgonagall under ~parson/DataMine/TimeRaptors4CSC523

Bash Shell Scripting for Data Science lecture slides for first hour 11/29/2021


ASSIGNMENTS

To start Weka on campus PCs with S: mounted go to S:\ComputerScience\WEKA and click one of the batch files.


ASSIGNMENT 1: Review of Classification, Introduction to Instance Based (Lazy) Learning
PREP:
    Compilation of Weka slides on Instance Based Learning and Clustering
    Wissam Malke's thesis "Machine Listening with Very Small Training Datasets"
        Slides for his thesis
        Follow-up white paper "Mapping Data Visualization to Timbral Sonification and Machine Listening"
    Instance-Based Learning Algorithms, a paper from 1991.
    K*: An Instance-based Learner Using an Entropic Distance Measure, a paper from 1995.
    Locally Weighted Naive Bayes, a paper from 2012.
    A graph on informational entropy, relates to building rules, decision trees, and K*.
    Kappa statistic summary
    Audio signal overview for Assignment 1 (reused with data modifications for fall 2021).

ASSIGNMENT 1 on instance-based learning is due via D2L Assessments -> Assignments by end of Feb. 25.
    Late penalty is 10% per day starting 9 AM after the deadline. This deadline bumped from Feb 18 on 2/1/2023.
    If your training and testing files have attributes ordered differently or tid is missing from training data you may see this pop-up:
attribute
          mapping
Just click Yes and you'll see the attribute mapping like this when you run the classifier.
    Here are the AddExpression derived attribute distributions for checking your AddExpression results.
    Here is my Python script for deriving an alternate representation of these data.

Post-mortem on Assignment 1 out-of-range funfreq for some PulseOsc:
    Comparison of lowest and highest tfreq tags by tosc, max funfreq for PulseOsc.
    Max funfreq time domain and frequency domain graphs for tid 234361.
    Frequency domain bins with my debugging hack for 234361 extracted by my ChucK program (Chuck Reference).
    ChucK program that generated 234361 WAV file.
    Python program for converting above ChucK output to CSV.
    Non-Chuck Python program uses numpy and scipy to extract frequency bins from WAV files.
        Bin analysis debug printing for tid 234361.
        ARFF file that does not chop off frequencies below 100 Hz (no high-pass filter).
        ARFF file that does chop off frequencies below 100 Hz (a high-pass filter)
    Extending above ChucK program from 512 to 32,768 FFT bins fixes the aliasing problem > 2000 Hz.
        See on acad ~parson/DataMine/ We will go over this fix March 8.
            PulseUna_234361_512.txt
            PulseUna_234361_22050.txt
            PulseUna_234361_22050_examine.txt
    See new application of FFTs to deep learning here and here.
    Pivoted ARFF file, March 8, show step, frequency, and amplitude steps for 32,768-bin Chuck FFT extraction.


ASSIGNMENT 2 on regression & ensemble models is due by
    11:59 PM on Monday March 13, 2023 via D2L Assignment 2.
    Here is the README.txt that includes the questions.
    Potential pitfall: Q7: Re-Normalize if necessary to get Normalized non-target attributes.
        Continue using Normalized nontarget attributes unless otherwise instructed (at Q17).
    The standard 10% per day deduction for late assignments applies.
    The 13th is the start of spring break, so no office hours that day.
    The preceding weeks’ office hours are at usual times & modalities.

ASSIGNMENT 3 on time series regression is due by the end of Sunday April 9.

ASSIGNMENTS 4 & 5 will be student mini-research projects.

Assignment 5 grading criteria are here:
https://faculty.kutztown.edu/parson/fall2021/CSC558FallFinal2021.html#Assignment_5
Please make sure to turn in slides, data or data dictionary / schema,
    and Sections 4 & 5 of your PDF via D2L by 11:59 Wednesday, the night of the talks.
Project 5 grading rubrics:
    The 5.3 presentation counts for 20% of this project. Make it clear. Be ready to answer questions.
        If in the classroom, repeat the questions for Zoom students before answering them.
    The remaining 80% distribute as 10 points each for 5.1 and 5.2.a through 5.2.g.

 1 18:00 Reagan Newswanger       Superbowl adverisement data
 2 18:15 Olivia Weber             Proprietary dataset Direct Operating Profit correlations.
 3 18:30 Pei Hua Lin              Mobile strategy games from Kaggle
 4 18:45 Yelitza Pagan           PASSHE employee pay correlations
 5 19:00 Jack McNally            Student survey academic performance analysis (Kaggle)
 6 19:15 Ryan Quinn              Probability of voting per other attributes
   19:30                                   BREAK
 7 19:45 Nicholas Morello       Song attribute analysis from Kaggle
 8 20:00 Donna DeMarco       NSF-funded scholarship analysis
 9 20:15 Abby Komlenic           Academic fund raising amounts corerlated with other donor properties
10 20:30 Edwin Cadiz             Patient experience/reviews using Ablify to treat mental health conditions
11 20:45 Nik Golombek            Fuel Efficiency
12 21:00 Nathan Rew              USGS stream data differentials in Limestone hi vs. lo

Here is the link to Fall 2021 Assignments 4 & 5. We are using these specs.
    Here is the topic sheet for spring 2023.
    You can turn a data dictionary like this if working with proprietary data. It needn't be that long. Obfuscating such data is OK.
  
 We will use the same requirements as fall 2021.
    Assignment 4 is due Wednesday April 26 and Assignment 5 is due Wednesday May 10 via D2L.
        Student presentations will be during the Final Exam period at class time on Wednesday May 10.
        Additional sites beyond what is available on the fall 2021 page:
        https://waterdata.usgs.gov/nwis/rt Use across-time for a site or across-site for a time period.
            We have analyzed dissolved oxygen levels as a function of temperature, day of year, time of day.
            You could analyze changes in properties as a function of year or year,month for a single site.
            You could analyze differences in attributes as a function of different sites in the same time period.
        https://data.noaa.gov/datasetsearch/ and https://www.ncei.noaa.gov/cdo-web/ fopr climate data.
            You could look for global stilling (reduction in wind speed differences) for two sites N-to-S.
        Kaggle is a popular data source. Make your own analysis or variation.
        John Hopkins Coronavirus data & maybe others until March 2023.
    Liquid Interactive is sponsoring 3 prizes ($1000, $500, and $250) for the top-three projects in 2023.
        Connor Ellis' presentation and short paper from fall 2021.
        Kelly Fox's presentation and short paper from spring 2020.
        Tyler Stoney's presentation and short paper from spring 2018.

ZOOM VIDEO ARCHIVE

Jan 25 class: First day handout, overview / review of some CSC458 projects & related, introduction / review of Weka.
Feb 1 class: Went over instance-based learning slides & a Weka lazy/instance based demo using PA water data.
    Assignment 1 due date slipped to 2/25. We will go over it & have a work session on 2/8.
Feb 8 class: Went over Assignment 1 handout + short discussion of Bayes and conditional probability.
Feb 15 class Went over slides on clustering & demoed Weka K-means clusters using Assignment 1 data.
    Assignment 1 demo of AddExpression + Q&A in last part of class.
Feb 22 class went over slides on Evaluating Regression and Ensemble Models, then Assignment 2 up to Discretize
Mar 1 class Post-mortem on Assignment 1 above, including alternate data preprocessing, finished Assignment 2 from Discretize until Q13.
Mar 8 class Association Rules & Clustering. Preceded by some Assignment 2 Q&A for first 20ish minutes.
    I missed hitting Zoom RECORD after 8 PM so I will make those topics clear in Assignment 3.
Mar 22 class End of Assignment 2 and went over Assignment 3 handout & related Hawk Mountain background.
Mar 29 class Graduate Certificate discussion, table-driven Python scripting for CSC523, work session.
    Assignment 3 handout updated with Q&A from last night's work session.
Apr 5 class Went over Python code to run Weka command line for all Assignment 3 variants.
Apr 12 class Went over solutions to Assignment 3 & requirements for Assignments 4 & 5.
April 19 class Script for fixing CSV files with some rows of data that are too big or small, bash shell commands for manipulating text data.
April 26 class some Q&A on Assignments 4&5 and my extension Python script for Assignment 3 RAPTOR_td, RAPTOR_td_log10 target attributes.
     Assignment 5 due date is bumped to May 10 11:59 PM, the night of the 15-minute presentations.
May 3 class Some reminders about Assignment 5, then short discussion of my Assignment 3 work in progress, then work time Q&A.
May 10 class student presentations of assignment 4-5.