CPSC523 Fall 2024 Assignment 2

CPSC 523 - Scripting for Data Science, Fall 2024,
Tuesday 6:00-8:50 PM, Old Main 158 .
Assignment 2 is due via make turnitin on acad or K120023GEMS by 11:59 PM on Sunday October 13.
Late charge is the standard 10% per day. Assignments must be in before I go over my solution.
I will be on vacation Thursday October 10 through the 15th, no classes or office hours.

This assignment builds on signal processing analysis begun in CSC558 back in spring 2020
and brought to its most recent state of straightforward prediction in CSC458 in spring 2024.
We are using the current dataset in CSC558 as well, with analyses there performed in Weka.
Details on this dataset are in this semester's CSC558 Assignment 2.

The modified dataset used in the current assignment adds significant, non-white-noise ambiguity to the prior
data by adding together multiple copies of a given waveform type with differing frequencies and amplitudes. You
must analyze this more complicated dataset, both for classification and regression. Please take notes
when I go over the predecessor data in class.

Examine Figures 1 through 18 as follows in CSC558 Assn2 as discussed there.

Perform the following steps on K120023GEMS.kutztown.edu:

cd                                    # places you into your login directory
mkdir DataMine              # all of your csc523 projects go into this directory
cd ./DataMine                 # makes DataMine your current working directory, it may already exist
cp ~parson/DataMine/CSC523f24RegressAssn2.problem.zip CSC523f24RegressAssn2.problem.zip
unzip CSC523f24RegressAssn2.problem.zip    # unzips your working copy of the project directory
cd ./CSC523f24RegressAssn2                            # your project working directory

Perform all test execution on K120023GEMS.kutztown.edu to avoid any platform-dependent output differences.
Large input CSV data files in Assignment 2 reside in ~parson/DataMine.
The project makefile uses symbolic links to give you access; notepad++ does not follow symbolic links.
Here are the files of interest in this project directory. There are a few you can ignore.
Make sure to answer README.txt in your project directory. A missing README.txt incurs a late charge.

The application domain reference for Assignment 2 is here:
https://faculty.kutztown.edu/parson/fall2024/CSC558F24Assn2Handout.html
This is a remake of a current CSC558 project in which we are using Python instead of Weka.

CSC523f24RegressAssn2_generator.py # your work goes here, analyzing correlation coefficients and kappa for regressors & classifiers
CSC523f24RegressAssn2_main.py # Parson's handout code for building & testing models that your generator above provides
makefile                             # the Linux make utility uses this script to direct testing & data viz graphing actions
makelib                              # my library for the makefile
CSC223f24FRQDassn2.csv.gz is the input data file linked from ~parson/DataMine when you enter make test or make links.
CSC523F24Assn2Useful.csv.ref is a reference file showing non-useless data after make test or make links.
CSC523f24Assn2Structured.txt.ref is a reference file showing tree and linear expression after make test or make links.
CSC523f24Assn2Summary.txt.ref is a local file showing model test summary results.
CSC523f24Assn2Trace.txt.ref is a local file showing single nontarget-to-target attributes kappas for classification,
    or CCs, MAEs, and RMSEs for regression. This approach for classification is my experiment that we will go over.
    This standard approach for regressions depends on one of your coding steps.
Output file CSC523F24Assn2Useful.csv is generated into and linked from ~parson/tmp/ to save file size costs to students.

To unzip CSC223f24FRQDassn2.csv.gz:
    make clean links
    cp CSC223f24FRQDassn2.csv.gz junk.csv.gz
    gunzip junk.csv.gz
You can now inspect junk.csv.
A subsequent make test, make clean, or make turnitin will remove any junk* file.

Here is the summary of STUDENT coding steps printed by make student.
grep 'STUDENT *[1-9].*%' CSC523f24RegressAssn2_generator.py | sed -e 's/^ *//'
# STUDENT 1 (1%): Complete the above comment block. Fill in the blanks.
# STUDENT 2 (9%): Complete regression in getKappaCCsMAEsRMSEs.
# STUDENT 3 (10%): Find the MDL DecisionTreeClassifier per tree depth.
# STUDENT 4 (10%): Reduce number of non-target attributes to the essential.
# STUDENT 5 (10%): Select distinct values of attribute tduplication
# STUDENT 6 (10%): Appoximate MDL for 'tfreq' by tree leaf size.

We will examine the handout code and README.txt in class.
This is my first scikit-learn assignment in which training & testing a model with data in
CSC523f24RegressAssn2_main.py returns measures of accuracy and errors back
to your generator, mostly for purposes of tuning model hyperparameters (config parameters).

From CSC523f24RegressAssn2_main.py:

def analyzeClassification(...):
    return (round(kappa,6), matrix, treedepth)
    # matrix is the confiusion matrix in row-major order
    # treedepth is numeric for decision trees, else None

def analyzeRegression(...):
    return (round(CC,6), round(MAE,6), round(math.sqrt(MSE),6), modeldepth)
    # Correlation Coefficient, Mean Absolute Error, Root Mean Squared Error,
    # depth of tree or number of terms in a linear expression, else None

Helper functions getKappaCCsMAEsRMSEs(...) and normalize are new.
Several helper functions now have a boolean isClassification parameter
to distinguish data preparation for classification versus regression.

Make sure that you answer all questions in README.txt in addition to passing make test before you run make turnitin.