CSC 523 - Scripting for Data Science, Fall 2020, Tu 6-8:50 PM in Old Main 158.

Dr. Dale E. Parson Class will be live on-line at class time via Zoom. Please read student instructions here.
Mon 6-8:50 PM, Zoom classes & recordings, http://faculty.kutztown.edu/parson
Class-time Zoom link for CSC523: See D2L Course CSC523 -> Content -> Overview for the link.
TO WATCH RECORDINGS AFTER 11/6, go here https://kutztown.zoom.us/ and Sign In using KU login.
IF you don’t want to be recorded or are a minor, use PRIVATE ZOOM CHAT to me for questions.
Please fill out & email Dr. Parson this permission to record slip. I will use it to take attendance in week 1.

Normal Office Hours 11/30-12/3: Monday 1-2, Tuesday 3:30-4:30, Wednesday 12-2, Thursday 3:30-4:30 or by appt.
Final week office hours 12/7-12/10: Monday 1-3, Tuesday & Thursday 11-12:30
See your course page for final exam work-session schedule.

KU Campus Mask policy: Resident students must wear a mask anytime they are outside of their personal room and within a building or with anyone else but their roommate. Commuter students must wear a mask anytime they are on campus within a building or with anyone. The course is 100% via Zoom at class time. I will record & post class videos, but want you there at class time.

PA: The Secretary's Order requires individuals to wear a face covering, in both indoor public places and in the outdoors when they are not able to consistently maintain social distancing from individuals who are not members of their household, such as on a busy sidewalk, waiting in line to enter a place, or near others at any place people are congregating. Whether inside in a public place or outside, and when wearing a face covering or not, everyone should socially distance at least 6 feet apart from others who are not part of your household.
 
Dr. Dale E. Parson, parson@kutztown.edu, Office hours: https://kutztown.zoom.us/j/94322223872
Office Hours Monday 1-2, Tuesday 3:30-4:30, Wednesday 12-2, Thursday 3:30-4:30 or by appt.   


KU offers a 4-course Graduate Certificate in Data Analytics. Talk with me if you want to sign up.
Deloitte has been recruiting Data Scientists (including graduate level) and Data Analysts (undergrad only) in Mechanicsburg, PA

First day handout (syllabus that is specific to this semester).

I commit to using each student's preferred name and preferred gender pronoun. Feel free to contact me in private if I make mistakes in pronunciation, name, gender, or anything else. Thanks! Here is a poll to which you can reply privately on paper or via email.

Gender-Based Crimes
Educators must report incidents of gender-based crimes, including sexual assault, sexual harassment, stalking, dating violence, and domestic violence.  If a student discloses such incidents to me during class or in a course assignment, I am not required to report the disclosure, unless the student was a minor at the time the incident occurred.  Regardless of the student’s age, if the incident is disclosed to me outside the classroom setting or a course assignment, I am required by law to report the disclosure, including relevant details, such as the names of those involved in the incident, to Public Safety and Police Services and to Mr. Jesus Peña, Title IX Coordinator.
Jesus A. Peña, Esq.
Deputy to the President for Compliance, Equity & Legal Affairs
(610) 683-4700
pena@kutztown.edu

There is a 10% per late late penalty for projects that come in after the due date.

 
RESOURCES & HANDOUTS.

Link to the Fall 2019 CSC458 course.
Here is the Anaconda site from which you can download MOST of the software tools we will use this semester.
    You can also do all of your development on acad. You will have to turn solutions in as source .py files on acad.
     Windows users can download the WinSCP file transfer client in the Computer Science sub-menu below here.
        I have read reports of adware being bundled with the FileZilla installer. I have used FileZilla for years with no problem.
We will be using Python 3.x. I will use IPython in lecture. You can use any interactive Python environment you like.
    You will turn in projects as stand-alone PROJECT.py scripts, with tests driven by my makefiles or my Python scripts.
How to Think Like a Computer Scientist looks like a good tutorial for Python newbies.
Dr. Schwesinger has posted some additional Python materials for CSC223.
Python regular expressions; a Python regular expression test harness.
    Here is a fall 2017 assignment 1 for CSC458 that serves as an intro to Python module re.
Scikit-learn will be the primary library for several of our projects.
We may need to install libraries from SciPy.org or Anaconda. Each project will outline its library requirement.

    Using Notepad++: Go to Settings->Preferences...->Language (since version 7.1) or Settings->Preferences...->Tab Settings (previous versions)
    Check Replace by space    
    To convert existing tabs to spaces, press Edit->Blank Operations->TAB to Space.
    If you are a vim editor user, create a file called .vimrc in your login directory with the following lines:
        set ai
        set ts=4
        set sw=4
        set expandtab
        set sta


ASSIGNMENTS

There is a 10% per late late penalty for projects that come in after the due date.

Assignment 1 on Python regular expressions for data extraction due 11:59 PM on September 24 (changed from 17th) via make turnitin.
    Here are my 9/21 class notes on Assignment 1 parts 6 & 8, updated to remove bugs on 9/22.

Assignment 2 is a Python data preparation & classification problem from a csc558 project in Weka, due 11:59 PM October 22.
    It derives from CSC558 Assignment 1 last semester. Here is the audio signal data overview.     
    My example Python code for a different, example csc458 problem is on acad at ~parson/DataMine/CSC523Example2.zip.
        Here is an ipython interactive trace that I used in working on this example code. Here is another for 9/21 on sklearn.
    A graph on information entropy, relates to building rules & decision trees.
    Here is my page on interpreting the Kappa statistic needed for Assignment 2.
    Here is a page comparing information entropy to gini index in deciding when/how to branch decision trees.
    A page describing Bayes theorem and related matters. Sklearn's GaussianNB is an implementation of Naive Bayes.
            A Bayes computer for a 52-card deck is on acad at ~parson/DataMine/BayesCards.py
    Chapter 4 of Weka textbook includes Naive Bayes & instance-based overviews.
    Chapter 12 of Weka textbook introduces ensemble learning.

Assignment 3 is a Python data preparation & numerical correlation problem from a csc558 project in Weka.
    Due by 11:59 PM on Thursday November 12 via make turnitin.
    Please see my email from Oct 27 AM regarding reduction in regressors/test time.
    Answer handout for spring 2020 CSC558 Assignment 2.
    Chapter 4 slides include linear models & model trees
    I adapted slides for Linear Regression and M5P Trees, 10/30/2017.
    Evaluating numeric prediction and Minimum Description Length from Chapter 5.
    My amended slides on Minimum Description Length and Evaluating Numeric Prediction.
    Here is an IBM site overview of interpreting linear regression.

Assignment 4 is a new, never before used Time Series project, due via make turnitin by 11:59 PM Thursday December 3.
    Some slides on time series.
    My solution to Spring 2020 csc558 Assignment 3 on time series.

Assignment 5 is the final exam project, for which I will answer questions for clarification only in the 11/30 and 12/7 classes.
    It is due by 11:59 PM on Saturday December 12. It is based on the CSC558 Assignment 3 from spring 2020.
    Here is my ad hoc joinARFF example from 12/7.

     ZOOM RECORDING ARCHIVES
         
    August 24 Overview of the course, first day handout, demo & discussion of COVID@KU graphical simulation.
    August 31 class, went over two regular expression assignments from previous semesters and started on Assignment 1.
    September 1 office hours (recorded with attendees' permission), Q&A on finding your regular expressions in assn1.
    September 3 office hours I went over Assn1 Part 6 mini-lecture on using Python dictionaries to aggregate full TCP streams.
    September 14 class, went over prep for Assignment 2 (see above), then had Q&A on Assignment 1 last hour of class.
    September 16 office hours, working through processing the final data attribute for User Datagram Protocol data.
    September 21 class, went over above Weka assignment as it relates to scikit-learn, last hour was Q&A on Assignment 1 parts 6 & 8.
            Here are my class notes on Assignment 1 parts 6 & 8, updated to remove bugs 9/22.
    September 28 class, went over my Hawk Mtn. analysis in sklearn under Assn2 above, also kappa, information entropy, & gini measures.
    October 5 class, went over most of Assignment 2 handout. KU VPN (Virtual Private Network) software is here.
        Once on VPN, you can ssh or putty directly into mcgonagall.kutztown.edu.
    October 12 class, completed look at Assn2, also slides on conditional probabilities, instance-based learning, & ensemble learning.
    October 19 class, inspecting output analysis of assn1, going over predicting numeric attributes & their error measures, work session.
    October 20 debugging session, the out "for(ATTR,...):" loop in analyze() through successful testing.
    October 26 class, went over Assignment 3. Please see my email from Oct 27 AM regarding reduction in regressors/test time.
    November 2 class, finished going over Assignment 3 handout code & results, then work session until 9 PM.
    November 9 class, started discussion of Time Series Analysis, 50 minutes working time for project 3 at the end.
    November 16 class, hand out & go over next, Assignment 4.
    November 30 class, hand out & go over Assignment 5, Q&A on Assignment 4.
    December 7 class, went over some pitfalls from previous projects, some Q&A on Assignment 5, especially STUDENT A joinARFF(...).
        Here is my ad hoc joinARFF example from 12/7.