CPSC 523 - Scripting for Data Science, Fall 2024,
Tuesday 6:00-8:50 PM, Old Main 158 .
Assignment 2 is due via make turnitin on acad or K120023GEMS by 11:59 PM on
Sunday October 13.
Late charge is the standard 10% per day. Assignments must be in
before I go over my solution.
I will be on vacation Thursday October 10 through the 15th, no
classes or office hours.
This assignment builds on signal processing analysis begun in
CSC558
back in spring 2020
and brought to its most recent state of straightforward prediction
in CSC458
in spring 2024.
We are using the current dataset in CSC558 as well, with analyses
there performed in Weka.
Details on this dataset are in this semester's
CSC558 Assignment 2.
The modified dataset used in the current assignment adds
significant, non-white-noise ambiguity to the prior
data by adding together multiple copies of a given waveform type
with differing frequencies and amplitudes. You
must analyze this more complicated dataset, both for
classification and regression. Please take notes
when I go over the predecessor data in class.
Examine Figures 1 through 18 as follows in
CSC558 Assn2 as discussed there.
Perform the following steps on K120023GEMS.kutztown.edu:
cd
# places you into your login directory
mkdir DataMine
# all of your csc523 projects go into this directory
cd ./DataMine
# makes DataMine your current working directory, it may already
exist
cp
~parson/DataMine/CSC523f24RegressAssn2.problem.zip
CSC523f24RegressAssn2.problem.zip
unzip CSC523f24RegressAssn2.problem.zip
# unzips your working copy of the project directory
cd ./CSC523f24RegressAssn2
# your project working directory
Perform all test execution on K120023GEMS.kutztown.edu to avoid
any platform-dependent output differences.
Large input CSV data files in Assignment 2 reside in
~parson/DataMine.
The project makefile uses symbolic links to give you access;
notepad++ does not follow symbolic links.
Here are the files of interest in this project directory. There
are a few you can ignore.
Make sure to answer README.txt in your project directory. A
missing README.txt incurs a late charge.
The application domain reference for Assignment 2 is here:
https://faculty.kutztown.edu/parson/fall2024/CSC558F24Assn2Handout.html
This is a remake of a current CSC558 project in which we are using
Python instead of Weka.
CSC523f24RegressAssn2_generator.py # your work
goes here, analyzing correlation coefficients and kappa for
regressors & classifiers
CSC523f24RegressAssn2_main.py # Parson's handout code
for building & testing models that your generator above
provides
makefile
# the Linux make utility uses this script to direct testing &
data viz graphing actions
makelib
# my library for the makefile
CSC223f24FRQDassn2.csv.gz is the input data file linked
from ~parson/DataMine when you enter make test or make
links.
CSC523F24Assn2Useful.csv.ref is a reference file showing
non-useless data after make test or make links.
CSC523f24Assn2Structured.txt.ref is a reference file
showing tree and linear expression after make test or make
links.
CSC523f24Assn2Summary.txt.ref is a local file showing model
test summary results.
CSC523f24Assn2Trace.txt.ref is a local file showing single
nontarget-to-target attributes kappas for classification,
or CCs, MAEs, and RMSEs for regression. This
approach for classification is my experiment that we will go over.
This standard approach for regressions depends
on one of your coding steps.
Output file CSC523F24Assn2Useful.csv is generated into and
linked from ~parson/tmp/ to save file size costs to students.
To unzip CSC223f24FRQDassn2.csv.gz:
make clean links
cp CSC223f24FRQDassn2.csv.gz
junk.csv.gz
gunzip
junk.csv.gz
You can now inspect junk.csv.
A subsequent make test, make clean, or make
turnitin will remove any junk* file.
Here is the summary of STUDENT coding steps printed by make
student.
grep 'STUDENT *[1-9].*%' CSC523f24RegressAssn2_generator.py
| sed -e 's/^ *//'
# STUDENT 1 (1%): Complete the above comment block. Fill in the
blanks.
# STUDENT 2 (9%): Complete regression in getKappaCCsMAEsRMSEs.
# STUDENT 3 (10%): Find the MDL DecisionTreeClassifier per tree
depth.
# STUDENT 4 (10%): Reduce number of non-target attributes to the
essential.
# STUDENT 5 (10%): Select distinct values of attribute
tduplication
# STUDENT 6 (10%): Appoximate MDL for 'tfreq' by tree leaf size.
We will examine the handout code and README.txt in class.
This is my first scikit-learn assignment in which training &
testing a model with data in
CSC523f24RegressAssn2_main.py returns measures of accuracy and
errors back
to your generator, mostly for purposes of tuning model
hyperparameters (config parameters).
From CSC523f24RegressAssn2_main.py:
def analyzeClassification(...):
return (round(kappa,6), matrix, treedepth)
# matrix is the confiusion matrix in
row-major order
# treedepth is numeric for decision trees,
else None
def analyzeRegression(...):
return (round(CC,6), round(MAE,6),
round(math.sqrt(MSE),6), modeldepth)
# Correlation Coefficient, Mean Absolute
Error, Root Mean Squared Error,
# depth of tree or number of terms in a
linear expression, else None
Helper functions getKappaCCsMAEsRMSEs(...) and normalize are
new.
Several helper functions now have a boolean isClassification
parameter
to distinguish data preparation for classification versus
regression.
Make sure that you answer all questions in README.txt
in addition to passing make test before you run make
turnitin.