CSC 523 - Scripting for Data Science, Fall 2022, Monday 6-8:50 PM in Old Main 158.

Data Mining Effects of 50 Years of Climate Change at Hawk Mountain Sanctuary
    Thursday November 17 11-11:45 AM in Old Main 158.
     PowerPoint slides here. PDF slides are here. Here is the presentation Zoom video.
     Summer's Analysis of Hawk Mountain Sanctuary Observation Data from 1976 through 2021.

Dr. Dale E. Parson Class will be live on-line at class time via Zoom. Please read student instructions here.
Mon 6-8:50 PM, Zoom classes & recordings, http://faculty.kutztown.edu/parson
Class-time Zoom link for CSC523: See D2L Course CSC523 -> Content -> Overview for the link.
Student instructions for using Zoom.
IF you don’t want to be recorded or are a minor, use PRIVATE ZOOM CHAT to me for questions.
Please fill out & email Dr. Parson this permission to record slip. I will use it to take attendance in week 1.

Dr. Dale E. Parson, parson@kutztown.edu, Office hours: https://kutztown.zoom.us/j/94322223872
Office Hours Monday 2-4, Wednesday 4-6 (Zoom only), Thursday 10-11 or by appt. All available via Zoom.

KU offers a 4-course Graduate Certificate in Data Analytics. Talk with me if you want to sign up.

First day handout (syllabus that is specific to this semester).

I commit to using each student's preferred name and preferred gender pronoun. Feel free to contact me in private if I make mistakes in pronunciation, name, gender, or anything else.
Gender-Based Crimes
Educators must report incidents of gender-based crimes, including sexual assault, sexual harassment, stalking, dating violence, and domestic violence.  If a student discloses such incidents to me during class or in a course assignment, I am not required to report the disclosure, unless the student was a minor at the time the incident occurred.  Regardless of the student’s age, if the incident is disclosed to me outside the classroom setting or a course assignment, I am required by law to report the disclosure, including relevant details, such as the names of those involved in the incident, to Public Safety and Police Services and to Mr. Jesus Peña, Title IX Coordinator.
Jesus A. Peña, Esq.
Deputy to the President for Compliance, Equity & Legal Affairs
(610) 683-4700
pena@kutztown.edu

There is a 10% per late late penalty for projects that come in after the due date.

 
RESOURCES & HANDOUTS.

For students new to using our department's Linux servers:

Please log into acad mcgonagall and run the following commands:

$ python -V Python 3.7.7 $ ipython -V 7.14.0

If you see earlier version numbers, edit a file called .bash_profile in your login directory and add the following 2 lines at the top:

alias python="/usr/local/bin/python3.7" alias ipython="/usr/local/bin/ipython3"

Log out, log back in, and check the version numbers again. Let me know if you run into problems.

After that, ssh mcgonagall from acad and check the versions. They should be the same. CSC523 makes heavy use of mcgonagall in future assignments.
*****

D. Parson, 2022, Analysis of Hawk Mountain Sanctuary Observation Data from 1976 through 2021

Scikit-learn will be the primary library for several of our projects.
Here is the Anaconda site from which you can download MOST of the software tools we will use this semester.
    You can also do all of your development on acad. You will have to turn solutions in as source .py files on acad.
     Windows users can download the WinSCP file transfer client in the Computer Science sub-menu below here.
        I have read reports of adware being bundled with the FileZilla installer. I have used FileZilla for years with no problem.
We will be using Python 3.x. I will use IPython in lecture. You can use any interactive Python environment you like.
    You will turn in projects as stand-alone PROJECT.py scripts, with tests driven by my makefiles or my Python scripts.
How to Think Like a Computer Scientist looks like a good tutorial for Python newbies.
Python regular expressions; a Python regular expression test harness.
We may need to install libraries from SciPy.org or Anaconda. Each project will outline its library requirement.

Here are my introductory slides on Python. We will explore Python in class.


Using Notepad++: Go to Settings->Preferences...->Language (since version 7.1) or Settings->Preferences...->Tab Settings (previous versions)
    Check Replace by space    
    To convert existing tabs to spaces, press Edit->Blank Operations->TAB to Space.
    If you are a vim editor user, create a file called .vimrc in your login directory with the following lines:
        set ai
        set ts=4
        set sw=4
        set expandtab
        set sta

Please log into acad and run the following commands:

$ python -V Python 3.7.7 $ ipython -V 7.14.0

If you see earlier version numbers, edit a file called .bash_profile in your login directory and add the following 2 lines at the top:

alias python="/usr/local/bin/python3.7" alias ipython="/usr/local/bin/ipython3"

Log out, log back in, and check the version numbers again. Let me know if you run into problems.

After that, ssh mcgonagall from acad and check the versions. They should be the same. CSC523 makes heavy use of mcgonagall in future assignments.

INSTANCE-BASED (LAZY) LEARNING
    Compilation of Weka slides on Instance Based Learning and Clustering
    https://scikit-learn.org/stable/modules/clustering.html#
    Wissam Malke's thesis "Machine Listening with Very Small Training Datasets"
        Slides for his thesis
        Follow-up white paper "Mapping Data Visualization to Timbral Sonification and Machine Listening"
    Instance-Based Learning Algorithms, a paper from 1991.
    K*: An Instance-based Learner Using an Entropic Distance Measure, a paper from 1995.
    Locally Weighted Naive Bayes, a paper from 2012.
    sklearn.neighbors.KNeighborsClassifier and sklearn.neighbors.KNeighborsRegressor

ASSIGNMENTS
There is a 10% per late late penalty for projects that come in after the due date.

Assignment 1 due via make turnitin by 11:59 PM on Tuesday September 27.
    (Small add to fix DecisionString spec at top of this linked handout.)
    We will go over my solution ~parson/DataMine/CSC523assn1REfall2022.solution.zip and related code checktemps.zip in the 910/3 class.

Assignment 2 due via make turnitin by 11:59 PM on Thursday October 13.
    There are 2 files to edit. Re-read the handout before turning it in.
    Preceding overview on mechanisms for Assignment 2 Numeric Regression.
    Start at slide 60 Evaluating Numeric Prediction for correlation coefficient and error measures MAE and RMSE.

Assignment 3 on numeric data value compression and discretization due by 11:59 PM on Friday October 28 via make turnitin.
    Parson's discussion of the Kappa statistic. Here is the comparison of entropy versus gini (statistical) DecisionTree building as used in the assignment.
    ~parson/DataMine/CSC523Fall2022Classify.demo.zip has the pre-starting point for Assignment 3 code. See also checktemps.zip.
    A graph on informational entropy, relates to building rules & decision trees.
    A page describing Bayes theorem and related matters. A BayesNet example from the textbook.
    A Bayes computer for a 52-card deck is on acad at ~parson/DataMine/BayesCards.py

        Chapter 5 (5.1 - 5.5 week 8 - evaluation)
        Chapter 8 (week 6 - data transformations)
        Chapter 12 on Ensemble Learning    
            Sklearn classifiers: Dummy, DecisionTree, Naive Bayes GaussianNB, Naive Bayes CategoricalNB, ExtraTree, LinearSVC of:
        Support Vector Machines that infer boundaries between target class groupings.


Assignment 4 on nominal classification due by 11:59 PM on Tuesday November 22 via make turnitin.
    My related research paper from 2006. Here is one related book and then another.

Assignment 5 is a redo of one of Assignments 2, 3, or 4, using new regressors and/or classifiers with new configuration parameters.
    It is due via make turnitin by end of Tuesday December 13. Our "final exam" class on 12/12 will be a work session.

Invoking multiple Wekas as subprocesses of python:
~parson/DataMine/coroutine.py
~parson/DataMine/csc458ensemble5sp2021/parallel/csc458ParallelEnsemble5sp2021.py
~parson/DataMine/HawkMtn/analysis_scripts: grep -l subprocess.Popen *.py
day_climate2raptor.py
day_date2weather.py
plotcsv.py
year_climate2raptor.py
year_date2weather.py

 
     ZOOM RECORDING ARCHIVES
         
August 29 recording on intro to the course, start Python review, first look at Assignment 1 which will we examine in detail on 9/12.

Reagan of CSC458 graciously agreed we could record our office-hour tour of Putty & Notepad++ recorded during 9/7.

September 12 class went over Assignment 1 motivation & goals, docs, & STUDENT requirements in code comments.

September 19 class went over demo predecessor to upcoming Assignment 2 using sklearn regressors & Python generators.

September 26 class went over handouts for Assignment 2. We will have some class work time October 3.
Start at slide 60 Evaluating Numeric Prediction for correlation coefficient & error measures MAE and RMSE.

October 3 went over a solution to Assn1, checktemps.zip, and
CSC523Fall2022Classify.demo.zip above.

October 17 class went over a solution to Assn2, then upcoming Assn3 and related topics

Firefox & MS Edge see the following link. I am going to start posting them verbatim as well.
October 24 class information entropy (related to minimum description length MDL models), ensemble (a.k.a. meta) and instance-based (a.k.a.) model learning, some Q&A on Assn3.
https://kutztown.zoom.us/rec/share/AraPVtudc53HFxzNMxPy4K3u05zz-qfDmxeTeL5umdrtjyoXUzpNw6D9_wEAN01c.4k8TGaMESd9V1c--

October 31 class went over assn3 solution, then time-series data analysis concepts (slides) & prep for Assignment 4.
    Precursor to Assignment 4 is at ~parson/DataMine/CSC523Fall2022TimeMIDI.
    https://kutztown.zoom.us/rec/share/_pdR-KBLDBrgrXDYkQPkkUAoG7Hp6Th7t2hLY74jMfR9112G9Ub0_DP3KLqe3XRG.G86rusdtBx1T9Ooh
    Slides: https://faculty.kutztown.edu/parson/fall2020/CSC523TimeSeries.ppt (Links work on Firefox.)

November 7 class went over Assignment 4 code and README requirements.
    https://kutztown.zoom.us/rec/share/rtSaWiQKB9ZICZpfIH_Nu_CMXWH7uQdXKJ3RkYEXkC7nI7k7qMMZ5FenpG-JryjV.PY7OpialkfzrByuT

November 14 class went over Python coroutines & using them to run  parallel Linux processes such as Weka.
    https://kutztown.zoom.us/rec/share/afQW37wKG6dxzywLOr2xyucu02Ns6lZ9VUoSFoGm7A8pKYMgmw866aX-N5J8_J0Z.vcJ-vqP-7OxxPcQk

November 21 class Hawk Mountain Presentation from November 17.
    https://kutztown.zoom.us/rec/share/QwuAhW5-6ouR18YZsmDIuOUOVFeZA5aNgV1WCG8G7-kRFlPAayZzpgtpO5dYQ634.4U-Ch_HwrhOYJYmK

November 28 class a solution to Assignment 4, handout Assignment 5, overview of Clustering.

December 5 class "Organizational Mapping using Machine Learning and Proposal Data"
    CSC570 Presentation by Mike Lanciano. Then:
    3D and Parallel Attribute Graph Visualizations of Assignment 4 Data.
    Parallel Coordinates book by Alfred Inselberg
    Bash Shell Scripting for Data Science
    https://kutztown.zoom.us/rec/share/fIYq9dwH75chBv1UTN835XOjkY81RTH2teitIcfQE4LrEj8af8KNBVfo2rINXm-T.gOQF-j2hhVfLs1OC











.