CSC 458 - Data Mining & Predictive Analytics I, Fall 2022, TuTh 4:30-5:50 PM.
 

Data Mining Effects of 50 Years of Climate Change at Hawk Mountain Sanctuary
    Thursday November 17 11-11:45 AM in Old Main 158.
    PowerPoint slides here. PDF slides are here. Here is the presentation Zoom video.
    Summer's Analysis of Hawk Mountain Sanctuary Observation Data from 1976 through 2021.

Dr. Dale E. Parson
, http://faculty.kutztown.edu/parson
Class-time Zoom link for CSC458 OR See D2L Course CSC458 -> Content -> Overview for the link.
Student instructions for using Zoom.
IF you don’t want to be recorded or are a minor, use PRIVATE ZOOM CHAT to me for questions.
Please fill out & email Dr. Parson this permission to record slip. I will use it to take attendance in week 1.
Office Hours Monday 2-4, Wednesday 4-6 (Zoom only), Thursday 10-11 or by appt. All available via Zoom.
parson@kutztown.edu, Office hours: https://kutztown.zoom.us/j/94322223872


First day handout (syllabus that is specific to this semester).


I commit to using each student's preferred name and preferred gender pronoun. Feel free to contact me in private if I make mistakes in pronunciation, name, gender, or anything else. Thanks!

Gender-Based Crimes
Educators must report incidents of gender-based crimes, including sexual assault, sexual harassment, stalking, dating violence, and domestic violence.  If a student discloses such incidents to me during class or in a course assignment, I am not required to report the disclosure, unless the student was a minor at the time the incident occurred.  Regardless of the student’s age, if the incident is disclosed to me outside the classroom setting or a course assignment, I am required by law to report the disclosure, including relevant details, such as the names of those involved in the incident, to Public Safety and Police Services and to Mr. Jesus Peña, Title IX Coordinator.
Jesus A. Peña, Esq.
Deputy to the President for Compliance, Equity & Legal Affairs
(610) 683-4700
pena@kutztown.edu

There is a 10% per late late penalty for projects that come in after the due date. There will be a 10% deduction from a homework assignment for repeated web surfing, web-based chatting or other use of the Internet for activities unrelated to class activities during both lectures and working sessions. During a working session you may leave after completing and turning in all due work; you are encouraged to stay to get additional practice and ask questions. Thank you.
 
RESOURCES & HANDOUTS.

For students new to using our department's Linux servers:

Please log into acad or mcgonagall and run the following commands:

$ python -V Python 3.7.7 $ ipython -V 7.14.0

If you see earlier version numbers, edit a file called .bash_profile in your login directory and add the following 2 lines at the top:

alias python="/usr/local/bin/python3.7" alias ipython="/usr/local/bin/ipython3"

Log out, log back in, and check the version numbers again. Let me know if you run into problems.

Windows users can download the WinSCP file transfer client in the Computer Science sub-menu below here.  Textbook: Data Mining: Practical Machine Learning Tools and Techniques, Fourth Edition, Witten, et. al., ISBN 978-0128042915. You can buy a discounted copy of the 3rd Edition at the KU Book Store -- either edition is fine. I have put a copy of the 3rd edition of the textbook on reserve in Rohrbach Library. You can go to the front desk & borrow it overnight.

The Graduate Assistant Tutor schedule is here.
    If you are new to Python and have basic questions, I have told them it is OK for you to ask questions.
    Of course you are encouraged to come to my office hours in person or via Zoom.

I adapted from
Kotu’s & Deshpande’s Predictive Analytics and Data Mining: slides for Linear Regression and M5P Trees, 10/30/2017.

There is an excellent on-line video course Predictive Analytics Training with Weka (Introduction) by one of our textbook authors & Weka creators.

Here is our textbook's website. We will be using the Weka tool set, which you can download to your machine from here. (Download & install Weka 3.8.6)
    The PDF Appendix to our textbook is here. It is a 128-page tutorial on using Weka. Here is the Weka Wiki.
    I will draw some material from this textbook as well.

The Weka download page has this note:
If your computer has a display that has a high pixel density, and you are using Windows, Weka's user interfaces may not be scaled appropriately and appear tiny. Installing Java 9 or later solves this problem. Alternatively, in the Program menu of Weka's GUIChooser, go into Settings, and select WindowsLookAndFeel from the "Look and feel for UI" dropdown menu. Some Weka packages currently do not work (properly) with Java 9 or later (tigerJython and scatterPlot3D).

PYTHON.
    How to Think Like a Computer Scientist looks like a good tutorial for Python 3.x newbies.
    We will be using the 3.x version of Python.
    Try running python -V to see that you are getting Python 3.x.x as your default.
        From the mcgonagall machine (ssh mcgonagall from acad) do the following actions in bold:
        Edit a file called .bash_profile in your login directory (create it if needed) and add these 2 lines near the top.
                export PATH="/usr/local/bin:${PATH}"
                alias python="/usr/local/bin/python3.7"
                alias ipython="/usr/local/bin/ipython3"
        Save the file and exit the editor, log out and log back into mcgonagall.
        Now type this:
                python -V    # You should see this:
                    Python 3.7.7
            If you install python on your own machine, just running python will get you the simpler-to-use interpreter.
            I will use ipython in lecture.

    The Python website is at http://www.python.org/.
    There is a good on-line tutorial and reference by Steven F. Lott called Building Skills in Python. There is a PDF copy here.
    The IPython site is here.
We will be using Python for data preparation in assignment 1.
    We have Python installed on acad, but if you want your own copy:
    You can download Python 3.x from here. Use the most recent stable 3.x for this course.
    Documentation including tutorials for the 3.x library is here.

Here are my introductory slides on Python. We will explore Python in class.

RESOURCES

The pythex utility for testing Python regular expressions

D. Parson, 2022, Analysis of Hawk Mountain Sanctuary Observation Data from 1976 through 2021

D. Parson and A. Seidel, "Mining Student Time Management Patterns in Programming Projects," Proceedings of FECS'14: 2014 Intl. Conf. on Frontiers in CS & CE Education, Las Vegas, NV, July 21 - 24, 2014. Here are the slides for the talk and the outline for the follow-up tutorial "Using Weka to Mine Temporal Work Patterns of Programming Students."

D. Parson, L. Bogumil & A. Seidel, "Data Mining Temporal Work Patterns of Programming Student Populations," Proceedings of the 30th Annual Spring Conference of the Pennsylvania Computer and Information Science Educators (PACISE) Edinboro University of PA, Edinboro, PA, April 10-11, 2015. Here are the slides from the talk.

D. Parson, D. E. Hoch & H. Langley, "Timbral Data Sonification from Parallel Attribute Graphs," Proceedings of the 31st Annual Spring Conference of the Pennsylvania Computer and Information Science Educators (PACISE) Kutztown University of PA, Kutztown, PA, April 1-2, 2016. Here are the slides from the talk.

Textbook slides
        Chapter 1 (week 1 - overview)
        Chapter 2 (week 1 - input)
        Chapter 3 (week 3 - output)
        Chapter 4 (rules & trees week 5, linear models & model trees week 9, Bayesian inference week 11, clustering week 12)

           Compilation of Weka slides on Instance Based Learning and Clustering
            A graph on informational entropy, relates to building rules & decision trees.
            A page describing Bayes theorem and related matters.
            A Bayes computer for a 52-card deck is on acad at ~parson/DataMine/BayesCards.py
            BayesNet examples from the textbook.
        Chapter 5 (5.1 - 5.5 week 8 - evaluation)
        Chapter 8 (week 6 - data transformations)
        Chapter 12 on Ensemble Learning
            We used data from finalexam458fall2018.problem.zip at ~parson/DataMine to demo ensemble learning on 4/13/2021.
        Time-series Data Analysis

I will draw some material from this textbook as well.
        Chapter 1 (overview)
        Chapter 2 (overview)
        Chapter 3A (data exploration)
        Chapter 3B (data exploration)
        Chapter 4A (information-based learning)
        Chapter 6A (probability-based learning)
        Chapter 6B (probability-based learning)
        Chapter 8A (evaluation)
        Chapter 8B (evaluation)
        Appendix A (descriptive statistics & data visualization)
        Appendix B (introduction to probability)  

ASSIGNMENTS

Assignment 1, due 11:59 PM September 22 via make turnitin.
    Red line added to assignment spec on 9/22.
    I added some per-STUDENT-requirement ipython tutorial examples at that page's bottom on 9/4.
    Added 9/12:
The comments for my findInName2Col(...) and your findOutName2Col() incorrectly state:
map (Python dict) that maps the column number as an int to the name of the attribute stripped of leading & trailing blanks.
IT SHOULD SAY:
map (Python dict) that maps the name of the attribute stripped of leading & trailing blanks to the column number as an int.
My findInName2Col(...) is implemented correctly. Only the comments are backwards.
Also, the error messages disappear in a useful order if you complete STUDENT requirements 5 & 6 before 3 & 4, since function normalizePerMinMaxMutateInPlace(normdatarows, noNormalizeColSet) that contains 3 & 4 runs after 5 & 6.
Assignment 2 on numeric regression due 11:59 PM October 14 via D2L.

Assignment 3 on data compression & discrete classification is due 11:59 PM November 3 via D2L.
    Parson's discussion of the Kappa statistic. Parson's solution to spring 2021 CSC458 classification problem as a tutorial.

Assignment 4 on classification of nominal values and time-series analysis is due by 11:59 PM on Friday November 25 via D2L.
    My related research paper from 2006. Here is one related book and then another one.
  
Assignment 5 is a redo of one of Assignments 2, 3, or 4, using new regressors and/or classifiers with new configuration parameters.
    It is due via D2L by end of Thursday December 15. Our "final exam" class on 12/15 at 2 PM will be a work session.
 

ZOOM ARCHIVES
      
September 1 class was introductory overview of Python data & control structures.
September 6 class Data Analytics Certificate program, start Assignment 1 spec & code.
Reagan of CSC458 graciously agreed we could record our office-hour tour of Putty & Notepad++ recorded during 9/7.
September 8 class we completed going over Assn1. Please start this weekend as Tuesday will be a work session.
September 13 class was an assn1 work session, this video is 30 minutes of Q&A starting with applying HINTS at the bottom of the handout.
September 15 class certificate program, textbook slides chapter 1, and some related examples from Hawk Mountain dataset.
September 20 class went over remaining slides from Chapter 1 then Chapter 2 and look at related Hawk Mountain models.
September 22 class went over Chapter 3 slides and first into to using Weka.
September 27 class went over handouts & example Weka usage for Assignment 2. Thursday 9/29 will be a work session.
September 29 class brief Q&A during a work session.
October 4 class started going over prep materials beneath Assignment 3 above.
October 6 class went over remaining prep for Assignment 3, will send it out within a week.
October 18 class went over my solution to Assignment 2, random number generator seeds, and related.

October 20 class went over Assignment 3 handout & related topics & Weka demo.
October 25 class went over Bayes / conditional probabilities as used in model building.

October 27 class Naive Bayes assumption of statistical independence of attributes, started ensemble model learning
November 1 class went over time-series data analysis and draft plans for Assignment 4.
November 3 was a working session on Assignment 3.
November 8 class went over my Assignment 3 solution, Assignment 4 up to README Q1.
November 10 class more Assignment 4 including README then project work session.
November 15 class went over Python cleaning & time-lagging script used to prep Assignment 4.
November 17 class was my prerecorded video on Hawk Mtn. data analysis from earlier in the day
    plus some unrecorded Q&A after the video.
November 29 class (unlinked) Assignment 4 example solution, handout Assignment 5, overview of Clustering.
December 6 class 3D and Parallel Attribute Graph Visualizations of Assignment 4 Data
    Parallel Attributes book by Alfred Inselberg. Some additional 5D attributes visualization.