CSC 458 - Data Mining and Predictive Analytics I, Fall 2019

Sept 21, 2019 Field Trip to Hawk Mountain, be at Visitor Center by 9 AM. (Photo added 10/6/2019.)

The Weka download page has this note:

If your computer has a display that has a high pixel density, and you are using Windows, Weka's user interfaces may not be scaled appropriately and appear tiny. Installing Java 9 or later solves this problem. Alternatively, in the Program menu of Weka's GUIChooser, go into Settings, and select WindowsLookAndFeel from the "Look and feel for UI" dropdown menu. Some Weka packages currently do not work (properly) with Java 9 or later (tigerJython and scatterPlot3D).

CSC480 Special topics course in spring 2020:
This course increases breadth and depth of knowledge for students with experience in object- oriented programming for multimedia systems. Advanced topics include working with camera point-of-view and lighting sources for 3D graphics, recursive shapes and fractals, pixel-level image processing, and animated video composition. Students will program graphical images, video streams, audio signals, physical devices containing electronic sensors and effectors, and combinations of these media. There will be solo and team programming projects.
Prerequisites: CSC220 with a grade of C or better. (Presumably that prereq should have included "or unconditional admission to the Graduate program.".)

Dr. Dale E. Parson

We meet Wednesday 6-8:50 PM in Old Main 158.
All sections of students can meet at that time, live on-line via Zoom.
    Please read student instructions here. My instructions for CSC faculty are here.
    USE THIS LINK TO LOG INTO ZOOM AT CLASS TIME.
    I will post links to recorded archives at the bottom of this page within a day after each class.
    Last year's course page is here.
Fall 2019 Office Hours (Old Main 260): Mon 12:30-2:30, Tu 3-4, Wed 2:30-4:30, or by appointment

First day handout (syllabus that is specific to this semester).
Textbook: Data Mining: Practical Machine Learning Tools and Techniques, Fourth Edition, Witten, et. al., ISBN 978-0128042915. You can buy a discounted copy of the 3rd Edition at the KU Book Store -- either edition is fine. I have put a copy of the 3rd edition of the textbook on reserve in Rohrbach Library. You can go to the front desk & borrow it overnight.

I commit to using each student's preferred name and preferred gender pronoun. Feel free to contact me in private if I make mistakes in pronunciation, name, gender, or anything else. Thanks! Here is a poll to which you can reply privately on paper or via email.

Gender-Based Crimes
Educators must report incidents of gender-based crimes, including sexual assault, sexual harassment, stalking, dating violence, and domestic violence.  If a student discloses such incidents to me during class or in a course assignment, I am not required to report the disclosure, unless the student was a minor at the time the incident occurred.  Regardless of the student’s age, if the incident is disclosed to me outside the classroom setting or a course assignment, I am required by law to report the disclosure, including relevant details, such as the names of those involved in the incident, to Public Safety and Police Services and to Mr. Jesus Peña, Title IX Coordinator.
Jesus A. Peña, Esq.
Deputy to the President for Compliance, Equity & Legal Affairs
(610) 683-4700
pena@kutztown.edu

There is a 10% per late late penalty for projects that come in after the due date. There will be a 10% deduction from a homework assignment for repeated web surfing, web-based chatting or other use of the Internet for activities unrelated to class activities during both lectures and working sessions. During a working session you may leave after completing and turning in all due work; you are encouraged to stay to get additional practice and ask questions. Thank you.

 
RESOURCES & HANDOUTS. We will use research results published by students & me to discuss various topics.

Here are textbook slides from Kotu’s & Deshpande’s Predictive Analytics and Data Mining: Concepts and Practice using RapidMiner.
    I found this book at start of semester, will probably use it next time I teach the course.
    It is at a more appropriate level for the course, but the slides from all 3 textbooks cited on this page stink.
    However, I will use the slides from this book, with my own additions, since they are generally better.
    I adapted slides for Linear Regression and M5P Trees, 10/30/2017.

There is an excellent on-line video course Predictive Analytics Training with Weka (Introduction) by one of our textbook authors & Weka creators.

Here is our textbook's website. We will be using the Weka tool set, which you can download to your machine from here.
    The PDF Appendix to our textbook is here. It is a 128-page tutorial on using Weka. Here is the Weka Wiki.
    Additional Weka documentation is here.
    I will draw some material from this textbook as well.

PYTHON.
    How to Think Like a Computer Scientist looks like a good tutorial for Python newbies.
    Dr. Schwesinger has posted some additional Python textbooks for CSC223.
    We will be using the 2.x version of Python, although assignment 1 is compatible with both 3.x and 2.x.
    Try running python -V to see that you are getting Pythin 2.6.x or 2.7.x as your default.
        From the acad machine do the following actions in bold:
        Edit a file called .bash_profile in your login directory (create it if needed) and add these 2 lines near the top.
                alias python2="/usr/bin/python"
        Save the file and exit the editor.
        Now type this:
                python2      # type this to get the basic Python interpreter.
            You should see the interactive python interpreter that we will go over in class.
            If you install python on your own machine, just running python will get you the simpler-to-use interpreter.

    The Python website is at http://www.python.org/.
    There is a good on-line tutorial and reference by Steven F. Lott called Building Skills in Python. There is a PDF copy here.
    The IPython site is here.
We will be using Python for data preparation in assignment 1.
    We have Python installed on acad, but if you want your own copy:
    You can download Python 2.x or 3.x from here. Use the most recent stable 2.x for this course.
    Documentation including tutorials for the 2.x library is here, for 3.x is here.

Here are my introductory slides on Python. We will explore Python in class, so please attend in person or via RTVC.
    Using Notepad++: Go to Settings->Preferences...->Language (since version 7.1) or Settings->Preferences...->Tab Settings (previous versions)
    Check Replace by space
   
    To convert existing tabs to spaces, press Edit->Blank Operations->TAB to Space.
    If you are a vim editor user, create a file called .vimrc in your login directory with the following lines:
        set ai
        set ts=4
        set sw=4
        set expandtab
        set sta


RESOURCES

The pythex utility for testing Python regular expressions

D. Parson and A. Seidel, "Mining Student Time Management Patterns in Programming Projects," Proceedings of FECS'14: 2014 Intl. Conf. on Frontiers in CS & CE Education, Las Vegas, NV, July 21 - 24, 2014. Here are the slides for the talk and the outline for the follow-up tutorial "Using Weka to Mine Temporal Work Patterns of Programming Students."

D. Parson, L. Bogumil & A. Seidel, "Data Mining Temporal Work Patterns of Programming Student Populations," Proceedings of the 30th Annual Spring Conference of the Pennsylvania Computer and Information Science Educators (PACISE) Edinboro University of PA, Edinboro, PA, April 10-11, 2015. Here are the slides from the talk.

D. Parson, D. E. Hoch & H. Langley, "Timbral Data Sonification from Parallel Attribute Graphs," Proceedings of the 31st Annual Spring Conference of the Pennsylvania Computer and Information Science Educators (PACISE) Kutztown University of PA, Kutztown, PA, April 1-2, 2016. Here are the slides from the talk.

Textbook slides
        Chapter 1 (week 1 - overview)
        Chapter 2 (week 1 - input)
        Chapter 3 (week 3 - output)
        Chapter 4 (rules & trees week 5, linear models & model trees week 9, Bayesian inference week 11, clustering week 12)
            A graph on informational entropy, relates to building rules & decision trees.
            A page describing Bayes theorem and related matters.
            A Bayes computer for a 52-card deck is on acad at ~parson/DataMine/BayesCards.py
            BayesNet examples from the textbook.
        Chapter 5 (5.1 - 5.5 week 8 - evaluation)
        Chapter 8 (week 6 - data transformations)

I will draw some material from this textbook as well.
        Chapter 1 (overview)
        Chapter 2 (overview)
        Chapter 3A (data exploration)
        Chapter 3B (data exploration)
        Chapter 4A (information-based learning)
        Chapter 6A (probability-based learning)
        Chapter 6B (probability-based learning)
        Chapter 8A (evaluation)
        Chapter 8B (evaluation)
        Appendix A (descriptive statistics & data visualization)
        Appendix B (introduction to probability)
  

ASSIGNMENTS


Assignment 1
    Assignment 1, DUE 11:59 PM on Friday September 27 via "make turnitin".
        Here is the README.txt file and the example weatherToARFF2019.py.txt from 2nd half of 9/11 class (not Zoomed).
    Zoom video (Aug 21 on Mac) on capturing weather data via browser->Excel->CSV file, worth 20% of Assignment 1 (25 minutes).
    Comments from 9/11 work session for students working with CSV files having flaky Weather Underground data fields:

 I created notes from the 2nd half of the 9/11 class that I extended 9/12 morning on acad in ~parson/DataMine/csc458fall2019assn1/README.txt. If you have a set of Weather Underground CSV files that do not have flaky fields, then following the handout instructions for checking for blank fields or invalid units of measure will work fine. If 'make test" already works for you, AND if you have the checks for blank fields or invalid units of measure, then you have completed the assignment. 'make test' by itself is not enough, since the handout comments require checking for blank fields or invalid units of measure.
HOWEVER, if you get a CSV file with flaky data (one turned up last night), those instructions are not enough. 'make teststudent' will fail due to setting ISERROR to True, when your script exits. Therefore, do not set ISERROR to True. Instead, write a message to sys.stderr and set the output field to '?'. Again, if 'make test" already works for you, AND if you have the checks for blank fields or invalid units of measure, then you have completed the assignment, and you don't need to bother with data-error recovery. Your wunderground data was good. I highly recommend reading all of ~parson/DataMine/csc458fall2019assn1/README.txt, regardless, to understand the issues.
On 9/18 we will start new material on analyzing data in ARFF files using Weka.
Assignment 2
    Assignment 2 is due by 11:59 PM on October 19 via make turnitin.
    Assignment 2 preview page with sample datasets. A visualization of bird migration in Europe discovered by Bryan McNally.
    Here is my page on interpreting the Kappa statistic needed for Assignment 2. (UPDATED Oct 1, 2019)

Assignment 3
    Assignment 3 on using linear models to predict numeric attributes is due by 11:59 PM on Wednesday November 13.
    Here are last year's answers to their assignment 3. We will start going over this October 9.
    If you get this warning in assignments when using a Supplied Test Set (external .arff file) with more attributes than the Training Set,
        just answer "yes" to the warning dialog. If the training set is a strict subset of the test set with respect to attributes,
        where the attribute names and types are identical in both .arff files, this auto-mapping works fine.

Assignment 4
    Assignment 4 uses various models including Bayesians to consider the importance of using an additional weather station.
    Assignment 4 is due by 11:59 PM on Wednesday December 4 via make turnitin.
    I will NOT accept solutions to this Assignment 4 after 9 AM on Friday December 6. I need to turn back my solution.

Assignment 5
    Assignment 5 is a cumulative, take-home exam project. It is due by 11:59 PM on Wednesday December 11 via make turnitin.
        I will NOT accept solutions to this Assignment 5 after noon on Thursday December 12.
        Please read the RULES FOR THE FINAL in the handout.

Use command line on Mac to increase memory, do not use the .app at all:
ku135515parson:~ parson$ alias
alias weka='java -server -Xmx4000M -jar /Applications/weka-3-8-0/weka.jar'
alias wekanew='java -server -Xmx4000M -jar /Applications/weka-3-8-2/weka.jar'
ku135515parson:~ parson$ ls -ld /Applications/weka*
drwxr-xr-x@ 16 parson  admin  544 Apr 13  2016 /Applications/weka-3-8-0
drwxr-xr-x@  3 parson  admin  102 Apr 13  2016 /Applications/weka-3-8-0-oracle-jvm.app
drwxr-xr-x@ 16 parson  admin  544 Dec 21  2017 /Applications/weka-3-8-2
drwxr-xr-x@  3 parson  admin  102 Dec 21  2017 /Applications/weka-3-8-2-oracle-jvm.app

ku135515parson:~ parson$ ls -l /Applications/weka-3-8-0
total 85240
-rw-r--r--@  1 parson  admin     35147 Apr 13  2016 COPYING
-rw-r--r--@  1 parson  admin     16171 Apr 13  2016 README
-rw-r--r--@  1 parson  admin   6621937 Apr 13  2016 WekaManual.pdf
drwxr-xr-x@ 57 parson  admin      1938 Apr 13  2016 changelogs
drwxr-xr-x@ 27 parson  admin       918 Apr 13  2016 data
drwxr-xr-x@ 17 parson  admin       578 Apr 13  2016 doc
-rw-r--r--@  1 parson  admin       510 Apr 13  2016 documentation.css
-rw-r--r--@  1 parson  admin      1863 Apr 13  2016 documentation.html
-rw-r--r--@  1 parson  admin     42900 Apr 13  2016 remoteExperimentServer.jar
-rw-r--r--@  1 parson  admin  10759024 Apr 13  2016 weka-src.jar
-rw-r--r--@  1 parson  admin     30414 Apr 13  2016 weka.gif
-rw-r--r--@  1 parson  admin    359270 Apr 13  2016 weka.ico
-rw-r--r--@  1 parson  admin  10997325 Apr 13  2016 weka.jar
-rw-r--r--@  1 parson  admin  14758799 Apr 13  2016 wekaexamples.zip

Slides on Minimum Description Length and Evaluating Numeric Prediction,
    for the week of November 6, in preparation for Assignment 3.
    Read Sections 5.8 and 5.9 in the 3rd Edition textbook, 5.9 and 5.10 in the 4th Edition,
        on Evaluating Numeric Prediction and the Minimum Description Length principle.
        Read textbook sections on linear regression & M5P model trees to reinforce previous lecture material.
        Here is an IBM site overview of interpreting linear regression.

    ZOOM ARCHIVES

       Zoom video (Aug 21 on Mac) on capturing weather data via browser->Excel->CSV file, worth 20% of Assignment 1 (12.5 minutes).
       Zoom video of August 28 class, covering intro to the course. PC-based how-to on capturing weather data starts around 1:30 hour.
       Zoom video of September 4 class, covering slides for chapters 1 & 2, and most of assignment 1 handout / Python.
       Zoom video for first half of September 11 class, covering makefile-driven testing and related issues.
       Zoom video for September 18 class, covering the analytics certificate program and half of last year's assignment 2 / Weka preprocessing.
       Zoom video for September 25 class, completing last year's assignment 2 overview and surveying draft data for our assignment 2.
       Zoom video for October 2 class, going over my solution to assn1, assn2 handout, & textbook slides related to assn2, kappa.
       Zoom video for October 9 class, starting last year's assn3 and slides on linear regression, related error measures,
            minimum description length (MDL), and answering some questions during assignment 2 work session near the end.
       Zoom video for October 16 class time posted October 12, covers remainder of fall 2018's assignment 3 started on October 9,
            linear regression, M5P model tree, Random Tree & Random Forest applied to a numeric target attribute
            10-fold cross validation versus separate test data sets, and other material related this fall 2019's upcoming assignment 3.
       Zoom video of October 13, view this after October 12's above. I added some techniques to the Oct. 12 lesson that we may use in assn3.
             I also updated this handout from last fall, pages 13-18, to go with this recording.
       Zoom video for October 23 class, my solution to assignment 2, and new assignment 3, some kappa discussion.
       Zoom video for October 30 class, on Naive Bayes and Bayes Net approaches to data analysis.
       Zoom video for November 6 class, wrapping up Bayes, discussing supervised Discretize, and exploring examples of K-means clustering.
       Zoom video for November 13 work session, Q&A on nested ifelse() in AddExpression, and other assignment 4 topics.
       Scripting and Extension Languages as Career Levers for CS&IT Graduates. November 21 Research & Teaching Presentation.
            Here is a Zoom recording for that November 21 talk.
       Zoom video for November 20 class, covering Assignment 3 solution, Assignment 4 handout, instance-based learning & clustering.
       Zoom video for December 4 class, handout of Assignment 5, and a little Q&A on Assignment 4.
       Zoom video for December 11 final exam class, answered a few Assignment 5 questions & went over my Assignment 4 solution.