CSC 558 - Data Mining and Predictive Analytics II, Spring 2020, Tu 6-8:50 PM in Old Main 158.

Dr. Dale E. Parson

Students, can come to class or attend live on-line via Zoom.
         Please read student instructions here.
         Here is the Zoom link for attending remotely at class time.
         Zoom video archives go at the bottom of this page.
Spring 2020 Office Hours: Tu 2:30-4:30, Wed 12:00-2:00, Fri 1:30-2:30, or by appointment.
Parson office hours for last two weeks of spring 2020:
Tuesday Apr 28 2-3 PM (changed from 2:30-4:30)
Wed Apr 29 12-2 PM as usual
Fr May 1 1:30-3:30 (extra hour added)
Final exam week hours:
Tuesday May 5 2:30-4:30 PM as usual
Wed May 6 12-1 PM (one hour less)
Thursday May 7 1:30-3:30 (Friday's usual office hour cancelled)

Real-time on-line teaching via Zoom to commence March 23 through end of semester.
     You can attend interactively at normal class time if possible via our normal link https://kutztown.zoom.us/j/914190228.
     Use the Chrome browser if possible. Click the link before March 23 to auto-install Zoom if you haven't already.
     I will post  a link to a video recording of each class within 24 hours at the bottom of this course page as before.
     My office hours will take place at the normal times VIA THIS DIFFERENT, OFFICE HOUR ZOOM ROOM.
     Please watch your email & this course page. If Zoom runs out of steam, I will post YouTube videos & send email.

Deloitte is recruiting Data Scientists (including graduate level) and Data Analysts (undergrad only) in Mechanicsburg, PA

First day handout (syllabus that is specific to this semester).
Textbook: Data Mining: Practical Machine Learning Tools and Techniques, Fourth Edition, Witten, et. al., ISBN 978-0128042915. You can buy a discounted copy of the 3rd Edition at the KU Book Store -- either edition is fine.
There are on-line copies of the Third Edition available in Rohrbach Library.
There are 4 courses or 3 courses + (research or internship) in our Graduate Data Analytics certificate program.
    You need to register free for the program, and you can use courses from a CSIT master's program.
    Talk to me if you are interested.

I commit to using each student's preferred name and preferred gender pronoun. Feel free to contact me in private if I make mistakes in pronunciation, name, gender, or anything else. Thanks! Here is a poll to which you can reply privately on paper or via email.

Gender-Based Crimes
Educators must report incidents of gender-based crimes, including sexual assault, sexual harassment, stalking, dating violence, and domestic violence.  If a student discloses such incidents to me during class or in a course assignment, I am not required to report the disclosure, unless the student was a minor at the time the incident occurred.  Regardless of the student’s age, if the incident is disclosed to me outside the classroom setting or a course assignment, I am required by law to report the disclosure, including relevant details, such as the names of those involved in the incident, to Public Safety and Police Services and to Mr. Jesus Peña, Title IX Coordinator.
Jesus A. Peña, Esq.
Deputy to the President for Compliance, Equity & Legal Affairs
(610) 683-4700
pena@kutztown.edu

There is a 10% per late late penalty for projects that come in after the due date. During a working session you may leave after completing and turning in all due work; you are encouraged to stay to get additional practice and ask questions. Thank you.

 
RESOURCES & HANDOUTS. We will use research results published by students & me to discuss various topics.

Link to the Fall 2019 CSC458 prerequisite course.
Link to the Spring 2018 CSC558 offering.

Slides leading to Assignment 1 on Ensemble Learning.
    Fall 2017 CSC458 slides on evaluating numeric prediction.
    A summary of the Kappa Statistic.
    A subset of Chapter 5 on Evaluation and 7 on Data Transformations.
    Chapter 12 on Ensemble Learning.

Here are textbook slides from Kotu’s & Deshpande’s Predictive Analytics and Data Mining: Concepts and Practice using RapidMiner.

There is an excellent on-line video course Predictive Analytics Training with Weka (Introduction) by one of our textbook authors & Weka creators.

Here is our textbook's website. We will be using version 3.8.4 of the Weka tool set, which you can download to your machine from here.
    The PDF Appendix to our textbook is here. It is a 128-page tutorial on using Weka. Here is the Weka Wiki.
    Additional Weka documentation is here.
    I will draw some material from this textbook as well.

        

We may use the Kaggle site for a project some time this semester.

D. Parson and A. Seidel, "Mining Student Time Management Patterns in Programming Projects," Proceedings of FECS'14: 2014 Intl. Conf. on Frontiers in CS & CE Education, Las Vegas, NV, July 21 - 24, 2014. Here are the slides for the talk and the outline for the follow-up tutorial "Using Weka to Mine Temporal Work Patterns of Programming Students."

D. Parson, L. Bogumil & A. Seidel, "Data Mining Temporal Work Patterns of Programming Student Populations," Proceedings of the 30th Annual Spring Conference of the Pennsylvania Computer and Information Science Educators (PACISE) Edinboro University of PA, Edinboro, PA, April 10-11, 2015. Here are the slides from the talk.

D. Parson, D. E. Hoch & H. Langley, "Timbral Data Sonification from Parallel Attribute Graphs," Proceedings of the 31st Annual Spring Conference of the Pennsylvania Computer and Information Science Educators (PACISE) Kutztown University of PA, Kutztown, PA, April 1-2, 2016. Here are the slides from the talk.


ASSIGNMENTS


Readings related to several assignments:
    Machine Listening with Very Small Training Datasets, Wissam Malke’s January 2017 master’s thesis.
    Mapping Data Visualization to Timbral Sonification and Machine Listening (a spring 2017 unpublished paper by Parson, et. al.)
    Instance-Based Learning Algorithms, a paper from 1991.
    K*: An Instance-based Learner Using an Entropic Distance Measure, a paper from 1995.
    Locally Weighted Naive Bayes, a paper from 2012.
    A graph on informational entropy, relates to building rules, decision trees, and K*.

Audio signal overview from spring 2020 for Assignment 1.
Assignment 1 is due via make turnitin by 11:59 PM Wednesday February 19. 10% per day late penalty applies.
    Here is the slide on information entropy, which relates to how KStar and many decision trees make decisions.
        The link to this slide was broken on 2/4. We will go over it on 2/11.
    My Assignment 1 answers are posted here.

Assignment 2 is due via make turnitin by 11:59 PM Wednesday March 4. 10% per day late penalty applies.
    cp  ~parson/DataMine/whitenoise558sp2020/checkfiles.sh  checkfiles.sh
    from within your project directory before you make turnitin. It won't affect anything other than turning it in.
    My Assignment 2 answers are posted here.
    Added 3/29: A bash script to solve this assignment and its output to the terminal.
        ~parson/DataMine/whitenoise558sp2020shell.zip on acad
        Another bash script that projects the class attribute from other attributes for classification by scikit-learn.

Assignment 3 on analyzing time-series data is due by 11:59 PM April 15 via make turnitin.
    Time series PPT slides.
    My solution to Assignment 3 and a concept related to bonus points awarded for Q12.

Chapter 4 Weka slides starting at slide 76 on logistic regression, emphasis on hyperplanes and perceptrons (neural nets).
    "Algorithms: the basic methods" -- Instance based learning, nearest neighbor KD and Ball trees are in these slides.
My slides abstracted from Weka Section 7 on "Extending Linear Models". Support vector machines & neural nets.
    Slides for Chapter 6, Rules and Trees.
    Slides for Chapter 7, Extending instance-based and linear models.

Assignments 4 & 5 are individual student mini-research projects.
    Assignment 4 is due April 5, and Assignment 5 is due April 26, with talks to follow, as described in the linked handout.
    Liquid Interactive will make an award to the best overall project as described in the linked handout.

     ZOOM RECORDING ARCHIVES
           Jan 21, 2020 Last hour of class was start of an overview of using Weka.
           Jan 28, 2020 Went through fall's csc458's final comprehensive assignment, demoing how to use Weka, and also slides on error measures.
           Feb 4, 2020 Went over Assignment 1 and the audio signal data domain on which it is based.
           Feb 11, 2020 Went over slides for Ensemble Learning & also a case study of recruitment & retention of scholarship students.
           Feb 18, 2020 Went over Assignment 2 handout, Assignment 4 & 5 topics & Liquid award, answered some Assignment 1 questions.
           Feb 25, 2020 Went over history of student programmer study, data sonification, & machine listener, in context of instance-based models.
           Mar 3, 2020 Went over time series slides & examples, then work session Q&A.
           Mar 24, 2020 Went over my solution to assn2, handout for assn3, and some Q&A about assn4/5.
           Mar 31, 2020 Went over bash shell scripting for batch, command-line invocation of Weka, some assignment Q&A.
                April 1 office hours did some trial & error work on removing non-ASCII chars from .csv files.
           Apr 8, 2020 Office hour Q&A on ifelse(...) in MathExpression for Assignment 3. Did not record work session of 4/7.
           Apr 14, 2020 Assorted Q&A about assignments 3 & 5 during a work session.
           Apr 21, 2020 Went over my solution to assn3 time-series and answered numerous assn5 questions.
           Apr 28, 2020 The first half of final project presentations.
                I forgot to start recording until 5 minutes into Faith's presentation. Apologies to Lori and Faith.
           Apr 30, 2020 11 AM Faculty presentation on "Assessing a Scholarship Program for Underrepresented Students in CS&IT"
           May 5, 2020 The second half of the final project presentations.