CSC 458 - Data Mining & Predictive Analytics I, Spring 2024, Wed 6:00-8:50 PM. Old Main 158.
 

Dr. Dale E. Parson
, https://faculty.kutztown.edu/parson
Class-time Zoom link for CSC458 OR See D2L Course CSC458 -> Content -> Overview for the link.
Student instructions for using Zoom.
IF you don’t want to be recorded or are a minor, use PRIVATE ZOOM CHAT to me for questions.
Please fill out & email Dr. Parson this permission to record slip. I will use it to take attendance in week 1.

Office Hours Monday 2-4, Wednesday 4-6 (Zoom only), Thursday 10-11 or by appt. All available via Zoom.
parson@kutztown.edu, Office hours: https://kutztown.zoom.us/j/94322223872
Thursday May 9 office hour switched to 1-2 PM, others as above.

First day handout (syllabus that is specific to this semester).


I commit to using each student's preferred name and preferred gender pronoun. Feel free to contact me in private if I make mistakes in pronunciation, name, gender, or anything else. Thanks!

Gender-Based Crimes
Educators must report incidents of gender-based crimes, including sexual assault, sexual harassment, stalking, dating violence, and domestic violence.  If a student discloses such incidents to me during class or in a course assignment, I am not required to report the disclosure, unless the student was a minor at the time the incident occurred.  Regardless of the student’s age, if the incident is disclosed to me outside the classroom setting or a course assignment, I am required by law to report the disclosure, including relevant details, such as the names of those involved in the incident, to Public Safety and Police Services and to Mr. Jesus Peña, Title IX Coordinator.
Jesus A. Peña, Esq.
Deputy to the President for Compliance, Equity & Legal Affairs
(610) 683-4700
pena@kutztown.edu

There is a 10% per late late penalty for projects that come in after the due date. There will be a 10% deduction from a homework assignment for repeated web surfing, web-based chatting or other use of the Internet for activities unrelated to class activities during both lectures and working sessions. During a working session you may leave after completing and turning in all due work; you are encouraged to stay to get additional practice and ask questions. Thank you.
 
RESOURCES & HANDOUTS.

For students new to using our department's Linux servers:

Please log into acad or mcgonagall and run the following commands:

$ python -V Python 3.7.7 $ ipython -V 7.14.0

If you see earlier version numbers, edit a file called .bash_profile in your login directory and add the following 2 lines at the top:

alias python="/usr/local/bin/python3.7" alias ipython="/usr/local/bin/ipython3"

Log out, log back in, and check the version numbers again. Let me know if you run into problems.

Windows users can download the WinSCP file transfer client in the Computer Science sub-menu below here.
It is also possible to use the scp, ssh-based file copy command in Mac or Windows command line utilities.

Textbook: Data Mining: Practical Machine Learning Tools and Techniques, Fourth Edition, Witten, et. al., ISBN 978-0128042915. You can buy a discounted copy of the 3rd Edition at the KU Book Store -- either edition is fine. I have put a copy of the 3rd edition of the textbook on reserve in Rohrbach Library. You can go to the front desk & borrow it overnight.


  If you are new to Python you are encouraged to come to my office hours in person or via Zoom.

I adapted from
Kotu’s & Deshpande’s Predictive Analytics and Data Mining: slides for Linear Regression and M5P Trees, 10/30/2017.

Here is our textbook's web page. We will be using the Weka tool set, which you can download to your machine from here. (Download & install Weka 3.8.6)
    The PDF Appendix to our textbook is here. It is a 128-page tutorial on using Weka. Here is the Weka Wiki.
    I will draw some material from this textbook as well.

The Weka download page has (had?) this note:
If your computer has a display that has a high pixel density, and you are using Windows, Weka's user interfaces may not be scaled appropriately and appear tiny. Installing Java 9 or later solves this problem. Alternatively, in the Program menu of Weka's GUIChooser, go into Settings, and select WindowsLookAndFeel from the "Look and feel for UI" dropdown menu. Some Weka packages currently do not work (properly) with Java 9 or later (tigerJython and scatterPlot3D).

PYTHON.
    How to Think Like a Computer Scientist looks like a good tutorial for Python 3.x newbies.
    We will be using the 3.x version of Python.
    Try running python -V to see that you are getting Python 3.x.x as your default. We may be updating the version early in the semester.
        From the mcgonagall machine (ssh mcgonagall from acad) do the following actions in bold:
        Edit a file called .bash_profile in your login directory (create it if needed) and add these 2 lines near the top.
                export PATH="/usr/local/bin:${PATH}"
                alias python="/usr/local/bin/python3.7"
                alias ipython="/usr/local/bin/ipython3"
        Save the file and exit the editor, log out and log back into mcgonagall.
        Now type this:
                python -V    # You should see this:
                    Python 3.7.7
            If you install python on your own machine, just running python will get you the simpler-to-use interpreter.
            I will use ipython in lecture.

    The Python website is at http://www.python.org/.
    There is a good on-line tutorial and reference by Steven F. Lott called Building Skills in Python. There is a PDF copy here.
    I taught CSC223 using Python in fall 2023. There are many tutorial resources.
    The IPython site is here.
We will be using Python for data preparation in assignment 1.
    We have Python installed on acad, but if you want your own copy:
    You can download Python 3.x from here. Use the most recent stable 3.x for this course.
    Documentation including tutorials for the 3.x library is here.

Here are my introductory slides on Python. We will explore Python in class.

RESOURCES

The pythex utility for testing Python regular expressions

D. Parson, 2022, Analysis of Hawk Mountain Sanctuary Observation Data from 1976 through 2021

D. Parson and A. Seidel, "Mining Student Time Management Patterns in Programming Projects," Proceedings of FECS'14: 2014 Intl. Conf. on Frontiers in CS & CE Education, Las Vegas, NV, July 21 - 24, 2014. Here are the slides for the talk and the outline for the follow-up tutorial "Using Weka to Mine Temporal Work Patterns of Programming Students."

D. Parson, L. Bogumil & A. Seidel, "Data Mining Temporal Work Patterns of Programming Student Populations," Proceedings of the 30th Annual Spring Conference of the Pennsylvania Computer and Information Science Educators (PACISE) Edinboro University of PA, Edinboro, PA, April 10-11, 2015. Here are the slides from the talk.

D. Parson, D. E. Hoch & H. Langley, "Timbral Data Sonification from Parallel Attribute Graphs," Proceedings of the 31st Annual Spring Conference of the Pennsylvania Computer and Information Science Educators (PACISE) Kutztown University of PA, Kutztown, PA, April 1-2, 2016. Here are the slides from the talk.

Wissam Malke's thesis "Machine Listening with Very Small Training Datasets".

Mapping Data Visualization to Timbral Sonification and Machine Listening, Dale E. Parson, Wissam Malke, Halley Langley, and Danielle Emily Hoch, white paper, 2017.

D. Parson, "Simulated Contact Tracing  of COVID-19 Propagation at Kutztown University for Fall 2020"
        and the slides. Here is one video and here is another.

Textbook slides
        Chapter 1 (week 1 - overview)
        Chapter 2 (week 1 - input)
        Chapter 3 (week 3 - output)
        Chapter 4 (rules & trees week 5, linear models & model trees week 9, Bayesian inference week 11, clustering week 12)

           Compilation of Weka slides on Instance Based Learning and Clustering
            A graph on informational entropy, relates to building rules & decision trees.
            A page describing Bayes theorem and related matters.
            A Bayes computer for a 52-card deck is on acad at ~parson/DataMine/BayesCards.py
            BayesNet examples from the textbook.
        Chapter 5 (5.1 - 5.5 week 8 - evaluation)
        Chapter 8 (week 6 - data transformations)
        Chapter 12 on Ensemble Learning
        Time-series Data Analysis

I will draw some material from this textbook as well.
        Chapter 1 (overview)
        Chapter 2 (overview)
        Chapter 3A (data exploration)
        Chapter 3B (data exploration)
        Chapter 4A (information-based learning)
        Chapter 6A (probability-based learning)
        Chapter 6B (probability-based learning)
        Chapter 8A (evaluation)
        Chapter 8B (evaluation)
        Appendix A (descriptive statistics & data visualization)
        Appendix B (introduction to probability)  

ASSIGNMENTS

ASSIGNMENT 1 on Classification due by 11:59 PM on Thursday February 15 via D2L.
    Turn in files
CSC458S24ClassifyAssn1Turnin.arff and README.txt with your answers.
    Here is documentation on the Kappa metric.
    Here is a slide on information entropy used in decision tree building.

ASSIGNMENT 2 on Regression due by 11:59 PM on Thursday February 29 via D2L.

    Start at slide 60 Evaluating Numeric Prediction for correlation coefficient and error measures MAE and RMSE.
    See ~parson/DataMine/pearson.py on acad.

ASSIGNMENT 3 on data compression & discrete classification is due 11:59 PM March 21 via D2L.

ASSIGNMENT 4 Data Cleaning Project in Python due date via D2L Assignment 4 is Friday April 12 by 11:59 PM.

ASSIGNMENT 5 due 11:59 PM Thursday May 2 via D2L Assignment 5.
    ADDED a Python script to analyze monotonic cluster sequences for Q11 in this zip file.
        We will go over it on May 8 in class.


If you are new to working on our Linux systems, you can bring up a Windows CMD prompt and from there:
        ssh YOURLOGINID@acad.kutztown.edu
   

    It is possible to use the scp, ssh-based file copy command in Mac or Windows command line utilities.

    If you do not have a favorite Linux text editor, try running nano on acad. Python assignments must run on mcgonagall.

 

PYTHON SELF STUDY FROM LAST TERM:

For anyone not experienced programming Python, here are some tutorials for over spring break
from my CSC223 course last semester:
Weekly class time materials
Week 1: Python Resources and Python Basics
    Read and work along with Sections 1 through 5 of the Python Tutorial in parallel to our class time examination of Python basics.
Week 2 is about functions and function-like constructs in Python.
Week 3 is the sorting example ...
August 29 Class First-day handout, overview of the course, logging into acad & mcgonagall Linux servers for your projects.
August 31 Class Setting up python and ipython aliases on acad. Interactive walk through of primitive data types & some aggregate types
    (int, float, str, None, list, tuple, set, & frozenset types. The mutable object types are in bold.) Next time dicts (a.k.a. maps).
September 5 Class Completed data types with dictionaries, went over if-for-while control constructs, functions, and Python's use of indentation.

ZOOM ARCHIVES


Jan 24 Class introduced the course, certificates & minor in data science @ KU, Weka & concepts.
Jan 31 Class went over mechanics and concepts for Assignment 1.
Feb 7 Class on Assignment 1's extractAudioFreqARFF17Oct2023.py for extracting assn1's input
    dataset from 10,0005 .wav audio files. Here is a Feb 8 page on the ZeroR confusion matrices
    that confused us at the end of class. The last 55 minutes were work time with some recorded Q&A.
Feb 14 Class evaluating numeric prediction (slide 60-69), correlation coefficient, Assignment 2 handout.
Feb 21 Class solution to Assignment 1, compressing polynomial & exponential data relationships
    into linear relationships for use with linear machine learning algorithms, Assignment 2 Q&A.
    Solution to Assignment 1 removed for summer 2024.
Feb 28 Class some Q&A from office hours on Assignment 2, finish slides on prepping & teaching
    data science courses, hand out Assignment 3. There will be some work time next week.
Mar 6 Class Naive Bayes including playing card example, work time on Assignment 3.
March 20 Class went over using pythex utility to parse data copied from new MyKU (Banner) course listings.
    The Python regular expression library will be used in either assignment 4 or 5.
March 27 Class went over solution to Assignment 3 and then prep for Assignment 4 to be posted Friday 3/29.
April 3 Class half the slides on Instance-Based ("Lazy") Learning, examples in Weka & a visualization. Work time.
April 10 Class Finished slides on IBk nearest-neighbor instance-based models, also clustering, went over Assignment 5.
April 17 Class Went over slides, paper, ran simulated contact tracing for COVID@KU2020.
        50 minutes assn5 work time.
April 24 Class Parallel Coordinates Visualization, Data Sonification & Weka Listening Research from 2015-2017.
    See Parson, Hoch, Langley, and Malke references above.
May 1 Class went over summer 2022 and summer 2023 cleaning & analyses of Hawk Mountain data.