CSC 458 - Data Mining and Predictive Analytics I, Fall 2018

Dr. Dale E. Parson

My office hours for final exams week December 10-14 are:
Monday 12:30 - 2:30 PM (normal hours)
Wednesday 3:30-5:30 PM (half hour later than normal)
Thursday 2:30-3:30 PM (no Tuesday)

Our final exam class is scheduled for Tuesday, December 11, 6-8 PM. I will post this the final assignment 5 and the necessary files by noon on that Tuesday. I will answer questions only in Tuesday’s class between 6-8 PM, so come prepared to ask questions. Your make turnitin is due by 9 AM on Thursday and no later. Review MDL (Minimum Description Length), Kappa Statistic, and all of the modeling/classification tools you used in Assignment 2, 3, and 4 for this mini-project. There is no Python scripting on this project 5.
We will consider the smaller of two models that loses at least 10% of its description length without losing more than 10% of its modeling accuracy as the MDL between those two models.
    Notes from December 4 class on what topics will / won't be important to the exam.
    You can start working on it as soon as I post it at noon on the 11th.

See this text file on how & why I plan to use the higher of the mean and median of your project scores to determine your semester grade.
This approach penalizes no one while avoiding a single assignment 1 causing 25% of the class to suffer unduly.
    Next year's inclusion of CSC223 Advanced Scientific Programming as a prereq for CSC458 will avoid this need in the future.

Sections 310 and 301 meet Tuesday 6-8:50 PM in Old Main 158.
All sections of students, including the above & also sections 801 & 810, can meet at that time, live on-line via Zoom.
    Please read student instructions here. My instructions for CSC faculty are here.
    USE THIS LINK TO LOG INTO ZOOM AT CLASS TIME. Our Zoom course ID is 399215896.
    I will post links to recorded archives at the bottom of this page within a day after each class.
    Last year's course page is here.
Fall 2018 Office Hours (Old Main 260): Mon 12:30-2:30, Tu 3-4, Wed 3-5, or by appointment

First day handout (syllabus that is specific to this semester).
Textbook: Data Mining: Practical Machine Learning Tools and Techniques, Fourth Edition, Witten, et. al., ISBN 978-0128042915. You can buy a discounted copy of the 3rd Edition at the KU Book Store -- either edition is fine. I have put a copy of the 3rd edition of the textbook on reserve in Rohrbach Library for the semester. You can go to the front desk & borrow it overnight.

I commit to using each student's preferred name and preferred gender pronoun. Feel free to contact me in private if I make mistakes in pronunciation, name, gender, or anything else. Thanks! Here is a poll to which you can reply privately on paper or via email.

Gender-Based Crimes
Educators must report incidents of gender-based crimes, including sexual assault, sexual harassment, stalking, dating violence, and domestic violence.  If a student discloses such incidents to me during class or in a course assignment, I am not required to report the disclosure, unless the student was a minor at the time the incident occurred.  Regardless of the student’s age, if the incident is disclosed to me outside the classroom setting or a course assignment, I am required by law to report the disclosure, including relevant details, such as the names of those involved in the incident, to Public Safety and Police Services and to Mr. Jesus Peña, Title IX Coordinator.
Jesus A. Peña, Esq.
Deputy to the President for Compliance, Equity & Legal Affairs
(610) 683-4700

There is a 10% per late late penalty for projects that come in after the due date. There will be a 10% deduction from a homework assignment for repeated web surfing, web-based chatting or other use of the Internet for activities unrelated to class activities during both lectures and working sessions. During a working session you may leave after completing and turning in all due work; you are encouraged to stay to get additional practice and ask questions. Thank you.

RESOURCES & HANDOUTS. We will use research results published by students & me to discuss various topics.

Here are textbook slides from Kotu’s & Deshpande’s Predictive Analytics and Data Mining: Concepts and Practice using RapidMiner.
    I found this book at start of semester, will probably use it next time I teach the course.
    It is at a more appropriate level for the course, but the slides from all 3 textbooks cited on this page stink.
    However, I will use the slides from this book, with my own additions, since they are generally better.
    I adapted slides for Linear Regression and M5P Trees, 10/30/2017.

There is an excellent on-line video course Predictive Analytics Training with Weka (Introduction) by one of our textbook authors & Weka creators.

Here is our textbook's website. We will be using the Weka tool set, which you can download to your machine from here.
    The PDF Appendix to our textbook is here. It is a 128-page tutorial on using Weka. Here is the Weka Wiki.
    Additional Weka documentation is here.
    I will draw some material from this textbook as well.

Early in the semester, we will begin going over introductory PYTHON.
    We will be using the 2.x version of Python, although assignment 1 is compatible with both 3.x and 2.x.
    Try running python -V to see that you are getting Pythin 2.6.x or 2.7.x as your default.
        From the acad machine do the following actions in bold:
        Edit a file called .bash_profile in your login directory (create it if needed) and add these 2 lines near the top.
                alias py2="/usr/bin/python"
                alias mcg="ssh mcgonagall"
        Save the file and exit the editor.
        Also make those changes to a file called .bashrc in the login directory.
        ssh mcgonagall    # This logs you into the machine mcgonagall that has some extra utilities installed. All your files should be there.
        Now type this:
                python      # type this to get the basic Python interpreter.
            You should see the interactive python interpreter that we will go over in class.
            If you install python on your own machine, just running python will get you the simpler-to-use interpreter.
            After logging out and back into acad at a later time, just type your alias mcg to log into mcgonagall.

    The Python website is at
    There is a good on-line tutorial and reference by Steven F. Lott called Building Skills in Python. There is a PDF copy here.
    The IPython site is here.
We will be using Python for data preparation in assignment 1.
    We have Python installed on acad and mcgonagall, but if you want your own copy:
    You can download Python 2.x or 3.x from here. Use the most recent stable 2.x for this course.
    Documentation including tutorials for the 2.x library is here, for 3.x is here.

Here are my introductory slides on Python. We will explore Python in class, so please attend in person or via RTVC.
    Using Notepad++: Go to Settings->Preferences...->Language (since version 7.1) or Settings->Preferences...->Tab Settings (previous versions)
    Check Replace by space
    To convert existing tabs to spaces, press Edit->Blank Operations->TAB to Space.
    If you are a vim editor user, create a file called .vimrc in your login directory with the following lines:
        set ai
        set ts=4
        set sw=4
        set expandtab
        set sta


The pythex utility for testing Python regular expressions

D. Parson and A. Seidel, "Mining Student Time Management Patterns in Programming Projects," Proceedings of FECS'14: 2014 Intl. Conf. on Frontiers in CS & CE Education, Las Vegas, NV, July 21 - 24, 2014. Here are the slides for the talk and the outline for the follow-up tutorial "Using Weka to Mine Temporal Work Patterns of Programming Students."

D. Parson, L. Bogumil & A. Seidel, "Data Mining Temporal Work Patterns of Programming Student Populations," Proceedings of the 30th Annual Spring Conference of the Pennsylvania Computer and Information Science Educators (PACISE) Edinboro University of PA, Edinboro, PA, April 10-11, 2015. Here are the slides from the talk.

D. Parson, D. E. Hoch & H. Langley, "Timbral Data Sonification from Parallel Attribute Graphs," Proceedings of the 31st Annual Spring Conference of the Pennsylvania Computer and Information Science Educators (PACISE) Kutztown University of PA, Kutztown, PA, April 1-2, 2016. Here are the slides from the talk.

Textbook slides
        Chapter 1 (week 1 - overview)
        Chapter 2 (week 1 - input)
        Chapter 3 (week 3 - output)
        Chapter 4 (rules & trees week 5, linear models & model trees week 9, Bayesian inference week 11, clustering week 12)
            A graph on informational entropy, relates to building rules & decision trees.
            A page describing Bayes theorem and related matters.
            A Bayes computer for a 52-card deck is on acad at ~parson/DataMine/
            BayesNet examples from the textbook.
        Chapter 5 (5.1 - 5.5 week 8 - evaluation)
        Chapter 8 (week 6 - data transformations)

I will draw some material from this textbook as well.
        Chapter 1 (overview)
        Chapter 2 (overview)
        Chapter 3A (data exploration)
        Chapter 3B (data exploration)
        Chapter 4A (information-based learning)
        Chapter 6A (probability-based learning)
        Chapter 6B (probability-based learning)
        Chapter 8A (evaluation)
        Chapter 8B (evaluation)
        Appendix A (descriptive statistics & data visualization)
        Appendix B (introduction to probability)


    Assignment 1 on writing a Python data formatting script, due by 11:59 PM on Friday September 28
via make turnitin.
        My solution is in ~parson/DataMine/ on acad.
        Here are the grading rubrics.

    Assignment 2 on using Weka for classification is due by 11:59 PM on Saturday October 20 via make turnitin.
        Here is my page on interpreting the Kappa statistic needed for this assignment.
        On 10/9/2018 I emailed the class:
            In this instruction for assignment 2:
            4. Partition OxygenMgPerLiter into 10 bins using the Unsupervised -> Attribute -> Discretize filter. You will have to set config parameter ignoreClass to True, since this attribute is the class (i.e., the target attribute). You can decide whether to set useEqualFrequency to True or False. Setting it to True distributes values as evenly as possible across the bins, while leaving it as False distributes across equal ranges of OxygenMgPerLiter levels. YOU MUST MAINTAIN YOUR DATA’S APPROXIMATE INITIAL OxygenMgPerLiter DISTRIBUTION in this step, so make sure to use the correct value for useEqualFrequency. You can always Undo an incorrect guess.
        Make sure you Discretize ONLY OxygenMgPerLiter. A pitfall is to leave the attribute index at first-last, which will discretize all numeric attributes and completely throw off your analysis results, costing you points. So, don't do that.
        Midsummer 2012 spikes in dissolved oxygen at pH in Schuylkill River sites at Linfield, Norristown, and center Philly.
        Vegetation growth in rivers: influences upon sediment and nutrient dynamics may offer reasons for a seasonal spike.
                (Search for oxygen, pH, and photosynthesis in the paper. This paper is available via Rohrbach Library)
                macrophyte : a member of the macroscopic plant life especially of a body of water
        Thanks to Robert Elward for tracking down these two papers from the previous paper's references:
            Edwards, R.W. and Owens, M. 1962: The effects of plants on river conditions IV.
                The oxygen balance of a chalk stream. Journal of Ecology 50, 207–20.
            Owens, M. and Edwards, R.W. 1962: The effects of plants on river conditions III.
                Crop studies and estimates of productivity of macrophytes in four streams in southern England. Journal of Ecology 50, 157–62.

    Assignment 3 on applying linear regression & other linear techniques is due by 11:59 PM on Tuesday November 20 via make turnitin.
        Above handout edited on 10/31 to show a warning panel on page 5.
        Training Set, Good Test Set, Worst Test Set and Monthly Sample Ranges by Site for Assignment 3.
        I plan to take a personal day on November 20.
            I will pre-post a RTVC archive for that day's class. Don't procrastinate & send last-minute email.
        I adapted slides for Linear Regression and M5P Trees, 10/30/2017, posted here 11/13/2018.
        Here is my solution posted 11/26/2018.
        Will Strouse supplied references related to pH->DO, Temp->DO, and TimeOfDay->DO correlations.

    Assignment 4 on comparing Bayesian to techniques used previously is due by 11:59 PM on Friday December 7 via make turnitin.
        My solution is here. We will go over this at start of Tuesday 12/11 class, then go over the final project 5 posted at noon that day.

    Assignment 5, the final exam project, is posted as of noon 12/11, and is due by 9 AM on Thursday December 13 via make turnitin.
        I will not accept late solutions; I need to grade these in a timely manner.
        Assignments coming in any amount after 9 AM on December 13 earn 0%.
        In addition to ~parson/DataMine/ on acad, you can find the handout on the KU Windows network at:
            S:\ComputerScience\Parson\Weka\ OR S:\ComputerScience\Parson\Weka\finalexam458fall2018\

Use command line on Mac to increase memory, do not use the .app at all:
ku135515parson:~ parson$ alias
alias weka='java -server -Xmx4000M -jar /Applications/weka-3-8-0/weka.jar'
alias wekanew='java -server -Xmx4000M -jar /Applications/weka-3-8-2/weka.jar'
ku135515parson:~ parson$ ls -ld /Applications/weka*
drwxr-xr-x@ 16 parson  admin  544 Apr 13  2016 /Applications/weka-3-8-0
drwxr-xr-x@  3 parson  admin  102 Apr 13  2016 /Applications/
drwxr-xr-x@ 16 parson  admin  544 Dec 21  2017 /Applications/weka-3-8-2
drwxr-xr-x@  3 parson  admin  102 Dec 21  2017 /Applications/

ku135515parson:~ parson$ ls -l /Applications/weka-3-8-0
total 85240
-rw-r--r--@  1 parson  admin     35147 Apr 13  2016 COPYING
-rw-r--r--@  1 parson  admin     16171 Apr 13  2016 README
-rw-r--r--@  1 parson  admin   6621937 Apr 13  2016 WekaManual.pdf
drwxr-xr-x@ 57 parson  admin      1938 Apr 13  2016 changelogs
drwxr-xr-x@ 27 parson  admin       918 Apr 13  2016 data
drwxr-xr-x@ 17 parson  admin       578 Apr 13  2016 doc
-rw-r--r--@  1 parson  admin       510 Apr 13  2016 documentation.css
-rw-r--r--@  1 parson  admin      1863 Apr 13  2016 documentation.html
-rw-r--r--@  1 parson  admin     42900 Apr 13  2016 remoteExperimentServer.jar
-rw-r--r--@  1 parson  admin  10759024 Apr 13  2016 weka-src.jar
-rw-r--r--@  1 parson  admin     30414 Apr 13  2016 weka.gif
-rw-r--r--@  1 parson  admin    359270 Apr 13  2016 weka.ico
-rw-r--r--@  1 parson  admin  10997325 Apr 13  2016 weka.jar
-rw-r--r--@  1 parson  admin  14758799 Apr 13  2016

Slides on Minimum Description Length and Evaluating Numeric Prediction,
    for the week of November 6, in preparation for Assignment 3.
    Read Sections 5.8 and 5.9 in the 3rd Edition textbook, 5.9 and 5.10 in the 4th Edition,
        on Evaluating Numeric Prediction and the Minimum Description Length principle.
        Read textbook sections on linear regression & M5P model trees to reinforce previous lecture material.
        Here is an IBM site overview of interpreting linear regression.


    August 28, 2018 Intro to the course, first day handout. Began the Python slides, will resume at slide 7 "Use and, or, not" on September 4.
    September 4, 2018 Finished Python overview, then went over Python regular expressions. This session is critical to assignment 1.
    September 11, 2018 went over Assignment 1 handout and pseudocode comments in Final hour was lab time.
    September 18, 2018 went over Textbook slides Chapter 1 (overview) Chapter 2 (input) Chapter 3 (output)
            I added my own examples beyond the slides.
            For next week read above slides on this current page:
            "Mining Student Time Management Patterns in Programming Projects," Slides are here.
            "Data Mining Temporal Work Patterns of Programming Student Populations," Slides are here.
            My solution to last fall's Assignment 2 which is on last year's course page.
                ARFF files are available on acad/mcgonagall as documented
                    or Windows network folder S:\ComputerScience\Parson\Weka.

    September 25, 2018 Went over the above student time management study.   
        Just started examining the 2017 Assignment 2 project.
        Will finish that next week and also hand out & go over our Assignment 2 with Weka.
    October 2, 2018 We went over last year's assignment 2 & our current assignment 2, Weka on USGS data.
    October 16, 2018 We went over issues and a new data pattern about mid-summer exponential plant growth for assignment 2.
    October 23, 2018 We went over linear regression & M5P model trees, data preparation for assignment 3.
    October 30, 2018 We went over Assignment 3 on linear regression & M5P model trees.
    November 6, 2018 We went over Naive Bayes, Bayes Net, & Clustering in anticipation of Assignment 4.
    November 13, 2018 We went over a few illustrations for linear regression, M5P, & k-means clustering, then Assignment 4.
    November 20, 2018 On-line class only, video covers instance-based (lazy) learning,
ensemble learning (bagging & boosting), time series.
        Slides from spring 2018 CSC558 on instance-based learning (a.k.a. lazy learning).
Slides from spring 2018 CSC558 on ensemble learning (bagging, boosting, and RandomForest).
Slides from spring 2018 CSC558 on time series analysis.
            Time series relationships between chlorophyll-a, dissolved oxygen, and pH in three facultative wastewater stabilization ponds
    November 27, 2018 Review my solution for Assignment 3. Go over concepts of time series analysis using water data with chlorophyll.
    December 4, 2018 Went over mean or median semester grading plan, went over plan for finals week project 5.
    December 11, 2018 Went over my solution to Assignment 4 and took questions for final Assignment 5.