CSC 523 - Scripting for Data Science, Fall 2023, Monday 6-8:50 PM in Old Main 158.

My Thursday December 14 office hour moves to 2:30-3:30 PM so I can attend a thesis presentation.

Use Firefox or try other non-Chrome browser for these links. Chrome has problems
 
    Pitfalls of trying to use off-the-shelf AIs to do your work.

D. Parson, summer 2022, Analysis of Hawk Mountain Sanctuary Observation Data from 1976 through 2021
D. Parson, summer 2023, Analysis of Hawk Mountain Wind Speed to Raptor Count Trends from 1976 through 2021
Data Mining Effects of 50 Years of Climate Change at Hawk Mountain Sanctuary

    WAS Thursday November 17, 2022 11-11:45 AM in Old Main 158.
     PowerPoint slides here. PDF slides are here. Here is the presentation Zoom video.

Dr. Dale E. Parson Class will be live face-to-face or on-line at class time via Zoom.
Mon 6-8:50 PM, Zoom classes & recordings, https://faculty.kutztown.edu/parson
Class-time Zoom link for CSC523: See D2L Course CSC523 -> Content -> Overview for the link.
Student instructions for using Zoom.
IF you don’t want to be recorded or are a minor, use PRIVATE ZOOM CHAT to me for questions.
Please fill out & email Dr. Parson this permission to record slip. I will use it to take attendance in week 1.

Dr. Dale E. Parson, parson@kutztown.edu, Office hours: https://kutztown.zoom.us/j/94322223872
Office Hours Monday 3-5 PM, Wed. 3-5 (Zoom only), Thurs. 9:50-10:50 or by appt. All available via Zoom.


KU offers a 4-course Graduate Certificate in Data Analytics. Talk with me if you want to sign up.

First day handout (syllabus that is specific to this semester).

I commit to using each student's preferred name and preferred gender pronoun. Feel free to contact me in private if I make mistakes in pronunciation, name, gender, or anything else.

 
RESOURCES & HANDOUTS.

For students new to using our department's Linux servers:

Please log into acad mcgonagall and run the following commands:

$ python -V Python 3.7.7 $ ipython -V 7.14.0

If you see earlier version numbers, edit a file called .bash_profile in your login directory and add the following 2 lines at the top:

alias python="/usr/local/bin/python3.7" alias ipython="/usr/local/bin/ipython3"

Log out, log back in, and check the version numbers again. Let me know if you run into problems.

After that, ssh mcgonagall from acad and check the versions. They should be the same. CSC523 makes heavy use of mcgonagall in future assignments.
*****
I am teaching CSC223 for the first time this semester for anyone wanting to review Python tutorials & Zoom videos.

Scikit-learn will be the primary library for several of our projects.
Here is the Anaconda site from which you can download MOST of the software tools we will use this semester.
    You can also do all of your development on acad. You will have to turn solutions in as source .py files on acad.
     Windows users can download the WinSCP file transfer client in the Computer Science sub-menu below here.
        I have read reports of adware being bundled with the FileZilla installer. I have used FileZilla for years with no problem.
We will be using Python 3.x. I will use IPython in lecture. You can use any interactive Python environment you like.
    You will turn in projects as stand-alone PROJECT.py scripts, with tests driven by my makefiles or my Python scripts.
How to Think Like a Computer Scientist looks like a good tutorial for Python newbies.
Python regular expressions; a Python regular expression test harness.
We may need to install libraries from SciPy.org or Anaconda. Each project will outline its library requirement.

Here are my introductory slides on Python. We will explore Python in class.


Using Notepad++: Go to Settings->Preferences...->Language (since version 7.1) or Settings->Preferences...->Tab Settings (previous versions)
    Check Replace by space    
    To convert existing tabs to spaces, press Edit->Blank Operations->TAB to Space.
    If you are a vim editor user, create a file called .vimrc in your login directory with the following lines:
        set ai
        set ts=4
        set sw=4
        set expandtab
        set sta

8/28/2023 A student recommended from experience this free terminal emulator for remote work on acad / mcgonagall:
MobaXterm free Xserver and tabbed SSH client for Windows (mobatek.net)
The free version includes terminal emulator, SFTP-based upload/download, and SSH tunneling (great for connecting to mcgonnegall from acad) but just limits the user to 4 saved connections.

Please log into acad and run the following commands:

$ python -V Python 3.7.7 $ ipython -V 7.14.0

If you see earlier version numbers, edit a file called .bash_profile in your login directory and add the following 2 lines at the top:

alias python="/usr/local/bin/python3.7" alias ipython="/usr/local/bin/ipython3"

Log out, log back in, and check the version numbers again. Let me know if you run into problems.

After that, ssh mcgonagall from acad and check the versions. They should be the same. CSC523 makes heavy use of mcgonagall in future assignments.

INSTANCE-BASED (LAZY) LEARNING
    Compilation of Weka slides on Instance Based Learning and Clustering
    Weka Chapter 4, instance-based learning at slide 90, clustering at slide 102.
        Added 12/2/2023 ~parson/DataMine/sciKitClusterCSC523Fall2023Assn3.02Dec2023.zip
        https://scikit-learn.org/stable/modules/clustering.html
        https://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster
        https://scikit-learn.org/stable/modules/clustering.html#k-means
        https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn-cluster-agglomerativeclustering
        https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN

        Wissam Malke's thesis "Machine Listening with Very Small Training Datasets"
        Slides for his thesis
        Follow-up white paper "Mapping Data Visualization to Timbral Sonification and Machine Listening"
    Instance-Based Learning Algorithms, a paper from 1991.
    K*: An Instance-based Learner Using an Entropic Distance Measure, a paper from 1995.
    Locally Weighted Naive Bayes, a paper from 2012.
    sklearn.neighbors.KNeighborsClassifier and sklearn.neighbors.KNeighborsRegressor

ASSIGNMENTS
There is a 10% per late late penalty for projects that come in after the due date.

Assignment 1 Preparation for Python regular expression pattern matching and data extraction.
    Here are some related diagrams from TCP/IP Illustrated by Wright & Stevens.
    Here is a spring 2021 assignment that uses Python regular expressions.
    The solution we will go over in class is at /home/kutztown.edu/parson/DataMine/csc458spring2021py1.solution.zip.
        ~parson/ is a Linux alias for /home/kutztown.edu/parson/
        There is a copy of this solution zip file linked here.
   
https://pythex.org/ is a very valuable interactive tool and https://docs.python.org/3/library/re.html is library documentation.
    A student sent this link to this good tutorial on regex. I have back-stepped it to our current Python version 3.7.
Assignment 1 is due by 11:59 PM on Thursday September 28 via make turnitin after make test works.
        LINKs to files for anyone working away from acad, described in the handout.
        ^^^ 
DO NOT USE THIS "working away from acad" APPROACH IF YOU CAN USE PUTTY AND NOTEPAD++ ON ACAD & MCGONAGALL.
     Note from 9/13 office hours. When you type make test on the handout code it will stop on an error here:

        diff --strip-trailing-cr  Ethout.csv Ethout.csv.ref > Ethout.dif
        diff --strip-trailing-cr  TCPUDPout.csv TCPUDPout.csv.ref > TCPUDPout.dif
        diff: TCPUDPout.csv: No such file or directory
        make: *** [test] Error 2
     When you have everything except STUDENT 6 & 8 started and working, the next diff will pass:
        diff --strip-trailing-cr  Ethout.csv Ethout.csv.ref > Ethout.dif
        diff --strip-trailing-cr  TCPUDPout.csv TCPUDPout.csv.ref > TCPUDPout.dif
        diff --strip-trailing-cr  TCPStreams.csv TCPStreams.csv.ref > TCPUDPout.dif > TCPStreams.dif
        make: ***[test] Error some-error-number
You will always see the calls to diff. When you no longer see Error messages, it has passed all tests.
Added 9/18, How to interpret diff files when make test reports an error.
Added 9/18 Setting up putty and notepad++ including notepad++ tab-spacing.


Assignment 2 , code is due by end of Monday October 23 via make turnitin on acad or mcgonagall.
See October 16 add about over-fitting in this assignment.
ADDED October 4:

A student and I discovered during office hours that:

  1. Assignment 2's makefile will let you run make test on acad, BUT!!!
  2. There are rounding differences between acad's and mcgonagall's Python libraries that cause diffs and error results,

Therefore: MAKE SURE TO DO ALL make test RUNS on mcgonagall.
ADDED 10/6:
To test STUDENT 1 thru 4, comment out the inside of this table.
configTable = [ # list of 12-tuples per CSC523f23Regressassn2_main.py:
#       [modelType, dataName, regressorName, regressor,
#       nontargetDATAtrain, targetDATAtrain,
#       nontargetDATAtest, targetDATAtest,
#       nontargetNames, targetATTR, classifierLabels, accuracyList]
# COMMENT OR TEMPORARILY REMOVE CONTENTS TO BE RESTORED LATER.
]
Once STUDENT 1 thru 4 are done, the first diff should pass, no errors.
+ diff --ignore-trailing-space --strip-trailing-cr parson_CCs.txt parson_CCs.txt.ref

ADDED October 9:
The 3rd entry in the configTable has a mistake. (Thanks to the student who caught this.):
        ['regressor', 'minsmooth', 'LinearRegression', linearRegression,
            minsmoothTrainNontargetData, minsmoothTrainTargetData,
            minsmoothTestNontargetData, minsmoothTestTargetData,
            SmoothHeader[0:-1], RawHeader[-1], None, None],
SHOULD BE:
        ['regressor', 'minsmooth', 'LinearRegression', linearRegression,
            minsmoothTrainNontargetData, minsmoothTrainTargetData,
            minsmoothTestNontargetData, minsmoothTestTargetData,
            SmoothHeader[0:-1], SmoothHeader[-1], None, None],
RawHeader goes with raw and SmoothHeader goes with smooth in these tables.
After you make this fix there will be diffs in correct solutions with your LOGINID and raptor species instead of BW:
$ cat LOGINID_CSC523f23Regressassn2.txt.dif
9c9
< BW_All_smooth =
---
> BW_All =
$ cat LOGINID_CSC523Fall2023TimeRegressOut.txt.dif
11c11
< ATTRIBUTES FOR DATA 3 ['WindSpd_mean_smooth', 'HMtempC_mean_smooth', 'wnd_WNW_NW_smooth'] ->  BW_All_smooth
---
> ATTRIBUTES FOR DATA 3 ['WindSpd_mean_smooth', 'HMtempC_mean_smooth', 'wnd_WNW_NW_smooth'] ->  BW_All
After you fix that third entry in configTable, do the following:
$ make clobber getfiles
That pulls down the .ref files that I updated this morning.

Assignment 2 Preparation
    We will do Python + scikit-learn analysis of Hawk Mountain climate -> raptor counts that I did in Weka earlier this year.
   
Preceding overview on mechanisms for Assignment 2 Numeric Regression.
    Start at slide 60 Evaluating Numeric Prediction for correlation coefficient and error measures MAE and RMSE.
    We will discuss using scikit-learn in preparation for Assignment 2.
    $ ls -d ~parson/Scripting/CSCx23Fall2023DemoRegression*
        /home/kutztown.edu/parson/Scripting/CSCx23Fall2023DemoRegression
        /home/kutztown.edu/parson/Scripting/CSCx23Fall2023DemoRegression.zip
    Ensemble learning used in Assignment 2.
Chapter 12 on Ensemble Learning.
    Instance-based (a.k.a. Lazy) learning used in Assignment 2 (see links above).


Assignment 3
, code is due by end of Monday November 20 via make turnitin on acad or mcgonagall.
SEE START OF CLASS ZOOM RECORDING OF NOV. 6 TO CLARIFY STUDENT 5, 7, & 8,
    referring to the data visualization figures below "Addendum 11/4/2023".
 

Assignment 3 Preparation
    Assignment 1 data from CSC558 spring 2023 will be the starting point for planning Assignment 3.
    We will use scikit-learn classifiers instead of Weka and Scipy's wav file Fourier frequency spectrum extraction instead of ChucK.
    My notes from 10/22/2023 on classifiers and the confusion matrix, where a row = actual and a column = predicted.
   
A summary of the Kappa Statistic.
   
A page describing Bayes theorem and related matters.
    A Bayes computer for a 52-card deck is on acad at ~parson/DataMine/BayesCards.py

Assignment 4 on time-series analysis, due by end of Friday December 8
   

DUE DATE MOVED TO END OF SUNDAY DEC 10 due to a glitch in access to student file system.
Even if you cannot see your files on acad, a student has informed me that ssh'ing into mcgonagall works OK.

    via make turnitin on acad or mcgonagall. 
   
Small clarification on STUDENT 4 code added to Assignment spec on 11/27/2023.
 
Assignment 5 is due by 11:59 PM on Saturday December 16 via make turnitin.
    You must test on mcgonagall.
   
Small clarification on STUDENT 5 code added to Assignment spec on 11/27/2023.
 

     ZOOM RECORDING ARCHIVES Use Firefox or try other non-Chrome browser for these links. Chrome has problems
         
August 28 class Went over first day handout, started Assignment 1 prep per above link.
August 30 office hour went over with a student using Pythex to understand regular expressions in Assignment 1 prep.
    Here is the text of related email I sent to the class on August 30 on this topic.
September 11 class Finished Assignment 1 prep & went over Assignment 1 handouts. We will have some work time next week.
    The video is over 4 hours long, but only the first 2 hours 35 minutes have content. Zoom kept recording after class. Nothing is missing.
    The unzipped input data file on acad is at ~parson/DataMine/WireShark29Aug2023_FB_emails_WhyDidWaltKillGusCollapsed.txt
    You can get the other links to big data files by running make test, even when it reports an error.
    For STUDENT 7 My code has this after matching the UDP (User Datagram Protocol) line for getting the next line with the udptype.
    such as Domain Name System (query) or Domain Name System (response):
            matchobj = UDP_pattern.match(inline)
            if matchobj:
                udpSymSrc = matchobj.group(N) # get the UDP group fields
                ...
                udpNumDst = matchobj.group(N)
                inline = infile.readline()
                if isinstance(inline,bytes):
                    # convert byte sequence from a gzipped file into string
                    inline = inline.decode()
                inline = inline.strip()  # This is the next line containing the udptype. Increment lineno after calling writerow().
    Skip STUDENT 6 and 8 for now. Doing the other parts will cause this test to pass. That gets you more than halfway done.
        diff --strip-trailing-cr  TCPUDPout.csv TCPUDPout.csv.ref > TCPUDPout.dif
    Finally, the function in csc523sept2023.parts6and8.py does not store all the fields. See TCPStreams.csv on page 5 of the handout.
September 13 Office Hour session around 43 minutes, Assignment 1 parts 6 & 8, and how to debug diff file errors.
September 18 class work session until 7:15 then overview of correlation coefficient and data range reduction.
September 25 class Reviewed relational data tabular format, evaluating numeric prediction, ~parson/Scripting/CSCx23Fall2023DemoRegression.
October 2 class (restored) Went over Parson's solution to Assignment 1, then the Assignment 2 spec & handout code.
    See update at the bottom of the Assignment 2 spec added October 3. See addendum after Assignment 2 link above.
October 10 class Went over Ensemble Learning, Instance-Based Learning, Information Entropy, Assn2 Q&A.

CSC223 October 12 Class went over use of Numpy arrays in Assignment 2 then Pandas 1D Series and Numpy arrays here. Remedial info for CSC523.
October 16 Class [0.0, 1.0] normalization & reasons to use it, stochastic diffs in varying instance order in model training, small data size issues in assn2.

CSC223 October 29 Class went over ~parson/Scripting/notpandas/monthly_raptors/aggregateMonthly.py
        This lecture explores using CSV for lazy versus eager loads of CSV files into memory.
        Available as a zip file here.
October 23 Class went over audio signal overview, kappa, and confusion matrices in prep for Assignment 3.
October 30 Class (restored) went over a solution to Assignment 2 and then handout for Assignment 3.
    The second half of the November 6 class will be a work session.
November 6 Class went over Nov. 4/5/6 addendum to Assignment 3 handout to expand on README background info.
    We also clarified kappa and correlation coefficient. The last half was a work session.
November 13 Class went over Bayes Theorem and the need for statistically independent Evidence
    (non-target attributes), 
Bayes computer for a 52-card deck on acad at ~parson/DataMine/BayesCards.py,
     visualization of K-Nearest-Neighbor test instance -to- training instances, started Python generator dataflows.
November 20 Class went over the specifications and code for Assignment 4.
    Do not round the return result from calling mean(history) in makeAveragingClosure(Nyears, attributeColumnNumber)'s
        nested closure function averager(row). The comments are correct with no rounding, which will give a diff.
        Here is what my code does: return mean(history)
November 27 Class Went over Assignment 4 README handout and all of Assignment 5, then  work Q&A.
    See Assignment 4 and Assignment 5 handouts for small clarifications made 11/27 in class.
December 4 Class Went over Clustering in slides, Weka, and scikit-learn.
    ~parson/DataMine/sciKitClusterCSC523Fall2023Assn3.02Dec2023.zip on acad has the code.
    December 11 final exam slot with be a work session for Assignment 5.
December 11 Class Went over Assignment 5 requirements, fixing diffs from required changes, Q&A.