CSC 523, Scripting for Data & Analysis, Fall 2023, Assignment 2

CSC 523 - Advanced DataMine for Scientific Data Science, Fall 2023, M 6-8:45 PM, Old Main 158.

Assignment 2 Specification, code is due by end of Monday October 23 via make turnitin on acad or mcgonagall.

ADDED October 4:

A student and I discovered during office hours that:

Assignment 2's makefile will let you run make test on acad, BUT!!!
There are rounding differences between acad's and mcgonagall's Python libraries that cause diffs and error results,

Therefore: MAKE SURE TO DO ALL make test RUNS on mcgonagall.

ADDED October 9:
The 3rd entry in the configTable has a mistake. (Thanks to the student who caught this.):
        ['regressor', 'minsmooth', 'LinearRegression', linearRegression,
            minsmoothTrainNontargetData, minsmoothTrainTargetData,
            minsmoothTestNontargetData, minsmoothTestTargetData,
            SmoothHeader[0:-1], RawHeader[-1], None, None],
SHOULD BE:
        ['regressor', 'minsmooth', 'LinearRegression', linearRegression,
            minsmoothTrainNontargetData, minsmoothTrainTargetData,
            minsmoothTestNontargetData, minsmoothTestTargetData,
            SmoothHeader[0:-1], SmoothHeader[-1], None, None],
RawHeader goes with raw and SmoothHeader goes with smooth in these tables.
After you make this fix there will be diffs in correct solutions with your LOGINID and raptor species instead of BW:
$ cat LOGINID_CSC523f23Regressassn2.txt.dif
9c9
< BW_All_smooth =
---
> BW_All =
$ cat LOGINID_CSC523Fall2023TimeRegressOut.txt.dif
11c11
< ATTRIBUTES FOR DATA 3 ['WindSpd_mean_smooth', 'HMtempC_mean_smooth', 'wnd_WNW_NW_smooth'] -> BW_All_smooth
---
> ATTRIBUTES FOR DATA 3 ['WindSpd_mean_smooth', 'HMtempC_mean_smooth', 'wnd_WNW_NW_smooth'] -> BW_All
After you fix that third entry in configTable, do the following:
$ make clobber getfiles
That pulls down the .ref files that I updated this morning.

October 16: Link added for some email Q&A with student regarding sensitivity to instance order and related over-fitting.

Perform the following steps on acad or mcgonagall after logging into your account via putty or ssh:

cd                                    # places you into your login directory
mkdir DataMine              # all of your csc223 projects go into this directory
cd ./DataMine               # makes DataMine your current working directory, it probably already exists
cp ~parson/DataMine/CSC523f23Regressassn2.problem.zip CSC523f23Regressassn2.problem.zip
unzip CSC523f23Regressassn2.problem.zip    # unzips your working copy of the project directory
cd ./CSC523f23Regressassn2                            # your project working directory
make getfiles                # This downloads your per-student input data CSV and reference files.

Perform all test execution on mcgonagall to avoid any platform-dependent output differences.
All input and output data files in Assignment 2 are small and reside in your project directory.
Here are the files of interest in this project directory. There are a few you can ignore.

CSC523f23Regressassn2_generator.py # your work goes here, anaylyzing correlation coeffcients and linear & nonlinear regressors
CSC523f23Regressassn2_main.py # Parson's handout code for building & tesing models that your generator above provides
makefile                             # the Linux make utility uses this script to direct testing & data viz graphing actions
makelib                            # my library for the makefile
getall.sh                           # my Bash script for copying in your CSV input files and .ref files for make getfiles
csc523assn2Rosterfall2023.py # helper for getall.sh that sets up per-student individual projects
diffccs.sh                           # my Bash script to test per-student-project output to per-student reference files
arff2csv.py                         # my utility script for translating Weka ARFF (attribute-relation-file-format) into CSV files for input
arfflib_3_3.py                     # my utility for manipulating typed CSV data as used in the Weka data mining toolset
__pycache__                     # a subdirectory where Python stores compiled byte codes temporarily

REFERENCES

You can use ~parson/Scripting/CSCx23Fall2023DemoRegression/CSCx23Fall2023DemoRegression_generator.py on acad for examples
    of manipulating header rows and multiple rows of data. Each cell in the input CSV is a column's data in its row.
    This file is not part of your assignment. It provides example code for data preparation.
Our analysis paralles my spring and summer 2023 Weka study or correlating climate change to raptor counts at Hawk Mountain's North Lookout.
    STUDENT 1 through 4 requirements duplicate and extend Table 1 in Section 10 of that study.
    STUDENT 5 and 6 relate to per-student assignments for specific raptor species and observation years.

1. Introduction
2. Trend Analysis in Climate to Red-tailed Hawk Counts by Month
3. Trend Analysis in Climate to Sharp-shinned Hawk Counts by Month
4. Trend Analysis in Climate to American Kestrel Counts by Month
5. Trend Analysis in Climate to Broad-wing Hawk Counts by Month
6. Trend Analysis in Climate to Cooper's Hawk Counts by Month
7. Trend Analysis in Climate to Osprey Counts by Month
8. Trend Analysis in Climate to Northern Harrier Counts by Month
9. Trend Analysis in Climate to Northern Goshawk Counts by Month

The assigned table in csc523assn2Rosterfall2023.py shows your unique assignment.
You must not change this file. Here is the assigned table.

assigned = { 'agend932': ('RT', 10, 1990, 'RT_month_10.csv', 'RT_month_smooth_10.csv'), 'aroes474': ('RT', 11, 1985, 'RT_month_11.csv', 'RT_month_smooth_11.csv'), 'ccohe693': ('NG', 10, 1999, 'NG_month_10.csv', 'NG_month_smooth_10.csv'), 'dpate250': ('NH', 11, 1999, 'NH_month_11.csv', 'NH_month_smooth_11.csv'), 'eswan071': ('OS', 9, 2007, 'OS_month_9.csv', 'OS_month_smooth_9.csv'), 'jmcna260': ('AK', 10, 1993, 'AK_month_10.csv', 'AK_month_smooth_10.csv'), 'jrecc716': ('OS', 9, 2000, 'OS_month_9.csv', 'OS_month_smooth_9.csv'), 'larce410': ('CH', 10, 2001, 'CH_month_10.csv', 'CH_month_smooth_10.csv'), 'mling459': ('RT', 11, 1990, 'RT_month_11.csv', 'RT_month_smooth_11.csv'), 'ncheh472': ('OS', 10, 2003, 'OS_month_10.csv', 'OS_month_smooth_10.csv'), 'pagan': ('SS', 9, 1997, 'SS_month_9.csv', 'SS_month_smooth_9.csv'), 'parson': ('BW', 9, 1976, 'BW_month_9.csv', 'BW_month_smooth_9.csv'), 'pbart313': ('AK', 9, 1993, 'AK_month_9.csv', 'AK_month_smooth_9.csv'), 'plin983': ('SS', 10, 1997, 'SS_month_10.csv', 'SS_month_smooth_10.csv'), 'pperr657': ('NH', 10, 1999, 'NH_month_10.csv', 'NH_month_smooth_10.csv'), 'rwalt267': ('CH', 10, 1976, 'CH_month_10.csv', 'CH_month_smooth_10.csv'), 'smann624': ('NH', 9, 1999, 'NH_month_9.csv', 'NH_month_smooth_9.csv'), 'sshah594': ('SS', 9, 2001, 'SS_month_9.csv', 'SS_month_smooth_9.csv'), 'thall326': ('NG', 11, 1999, 'NG_month_11.csv', 'NG_month_smooth_11.csv'), 'vmari085': ('RT', 9, 2008, 'RT_month_9.csv', 'RT_month_smooth_9.csv'), 'wbliz011': ('CH', 10, 1995, 'CH_month_10.csv', 'CH_month_smooth_10.csv')}

The key is the student ID. Columns are raptor species, month for this dataset (it is a fixed month across multiple years),
your data's starting year (ending in 2021), the CSV file with regular climate properties and raptor count per year in that
month, and the CSV file with exponential smoothing on all attribute values in the first CSV file. The opening section of
my previous Weka study e xplains exponential smoothing as applied here.

A smoothed value in these graphs is SmoothedValue_timeT = (alpha X NormalizedValue_timeT) + ((1.0 - alpha) X NormalizedValue_timeT-1), with fractional multiplier alpha in the range [0.0, 1.0]. The graphs in this discussion use alpha = 0.1 to smooth the peaks and valleys in the normalized values in order to show long-term trends and slopes. [6]

As usual, make clean test tests your code and make turnitin turns it into me by the due date.
There is the usual 10% per-day late change after the deadline.

We will go over this Monday October 2 and at least half of Tuesday October 10th's class will be work time.
KU has a break day October 9, and October 10 follows Monday's schedule, including my 3-5 PM office hours.

An addendum from an email reply to a student posted here October 2:

Weka has three approaches to training then testing.

Train and test on the same, full dataset. The problem here is overfitting models to training data.
Train on one set of instances and test on another. Scikit-learn's model's training & testing functions are oriented this way.
N-fold cross-validation, where Weka defaults N to 10 but makes a variable available for user modification. 10-fold cross-validation does training on 9/10ths of randomly selected instances, tests on the remaining 10th, the repeats the process a total of 10 times with 9/10ths different training instances each time, i.e., "systematic and random sampling (without replacement)". Sklearn docs shows ways to do N-fold cross-validation, but it is not integrated into the train & test methods very well, appears cumbersome to use. I may explore it in an assignment, but it is not a great fit to sklearn.

A problem for assignment 2 is that we have only 46 instances, fewer if we focus on the years of declining raptor counts. The assignment will look at ways to ameliorate that limitation in this (much-reduced) dataset size. In a sense we are essentially back to #1 above with the likelihood of over-fitting. But it still produces useful models in the sense that they show recent trends in correlation between climate factors and raptor count declines.

ADDED TUESDAY OCTOBER 3
Here is the code I used to partition minraw into training & testing data some time after minraw = shuffle(minraw, random_state=42)

    minrawTrain = minraw[0:len(minraw)//2]
    minrawTrainNontargetData = [row[0:-1] for row in minrawTrain]
    minrawTrainTargetData = [row[-1] for row in minrawTrain]
    minrawTest = minraw[len(minraw)//2:]
    minrawTestNontargetData = [row[0:-1] for row in minrawTest]
    minrawTestTargetData = [row[-1] for row in minrawTest]

This code assumes the target attribute is in the last column, which is indexed by row[-1].
The minraw dataset was constructed with the target attribute in the last column.
The expression [row[0:-1] for row in minrawTrain] gets you a dataset of only non-target attributes.
minraw[0:len(minraw)//2] uses the first half of instances as training data for building a regressor model.
minraw[len(minraw)//2:] uses the last half of instances as test data for testing model accuracy.

You can construct target and nontarget training and testing data for minsmooth, maxraw, and maxsmooth similarly.
Stick with this naming convention. Look at the configTable at the bottom of your source file to confirm names.