CSC 523, Scripting for Data & Analysis, Fall 2024, Assignment 1

CSC 523 - Advanced Scripting for Data Science, Fall 2024, Tuesday 6:00-8:50 PM, Old Main 158.

Assignment 1 Specification, code is due by end of Friday September 20.
via make turnitin on acad or K120023GEMS.

Perform the following steps on K120023GEMS.kutztown.edu after logging into your account via putty or ssh,
    after doing your initial setup for the new Linux server K120023GEMS:

cd                                    # places you into your login directory
mkdir DataMine              # all of your csc523 projects go into this directory
cd ./DataMine               # makes DataMine your current working directory, it may already exist
cp ~parson/DataMine/CSC523f24ClassifyAssn1.problem.zip CSC523f24ClassifyAssn1.problem.zip
unzip CSC523f24ClassifyAssn1.problem.zip    # unzips your working copy of the project directory
cd ./CSC523f24ClassifyAssn1                            # your project working directory

Perform all test execution on K120023GEMS.kutztown.edu to avoid any platform-dependent output differences.
All input and output CSV data files in Assignment 1 reside in ~parson/DataMine.
The project makefile uses symbolic links to give you access; notepad++ does not follow symbolic links.
Here are the files of interest in this project directory. There are a few you can ignore.
Make sure to answer README.txt in your project directory. A missing README.txt incurs a late charge.

The application domain reference for Assignment 1 is here:
https://faculty.kutztown.edu/parson/fall2024/CSC558F24Assn1Handout.html
This is a remake of a current CSC558 project in which we are using Python instead of Weka.

Here is an add on 9/23 for why K-Nearest_Neighbor with number-of-neighbors = 5 does poorly.
It really has to do with an inadequate number of training instances
for 5 class values contained in only 10 training instances. Once you
get past KNN=2, you would have inaccurate results even in a balanced
training distribution of 2 target classes each. This training set is
even worse. See my answer below.

CSC523f24ClassifyAssn1_generator.py # your work goes here, analyzing correlation coefficients and kappa for regressors & classifiers
CSC523f24ClassifyAssn1_main.py # Parson's handout code for building & testing models that your generator above provides
makefile                             # the Linux make utility uses this script to direct testing & data viz graphing actions
    Sept 4 add: Please edit the makefile per these instructions.
makelib                            # my library for the makefile
CSC523F24Assn1Handout.csv.gz is the input data file linked from ~parson/DataMine when you enter make test or make links.
To unzip CSC523F24Assn1Handout.csv.gz:
    make clean links
    cp CSC523F24Assn1Handout.csv.gz junk.csv.gz
    gunzip junk.csv.gz
You can now inspect junk.csv.
A subsequent make test, make clean, or make turnitin will remove any junk* file.

Here is sample of the 5000 input data instances (rows) that we will split into 2500 training and 2500 testing instances.
Links go to the first of 1000 instance illustrations from CSC558 Assignment 1. We are using the same data in CSV format.

Distribution,Param1,Param2,Count,Mean,Hmean,Median,Pstdev,Pvariance,P25,P50,P75,Min,Max
uniform,0,100,10000,51,21,51,29,832,26,51,76,1,100
uniform,0,100,10000,51,21,51,29,825,26,51,76,1,100
uniform,0,100,10000,51,21,52,29,823,27,52,76,1,100
...
normal,50,15,10000,51,46,51,13,176,42,51,60,1,100
normal,50,15,10000,48,43,48,13,176,39,48,57,1,100
normal,50,15,10000,53,50,53,13,162,45,53,62,1,100
...
bimodal,50,15,10000,50,41,50,19,379,32,50,68,1,100
bimodal,50,15,10000,50,40,50,20,391,32,50,67,1,100
bimodal,50,15,10000,50,41,50,19,375,33,50,67,1,100
...
exponential,10,0,10000,13,5,9,12,140,5,9,18,1,100
exponential,10,0,10000,12,5,9,11,122,4,9,16,1,100
exponential,10,0,10000,14,5,10,12,152,5,10,18,1,100
...
revexponential,10,0,10000,90,87,93,11,117,86,93,98,1,100
revexponential,10,0,10000,90,87,94,11,112,86,94,98,1,100
revexponential,10,0,10000,90,87,93,11,126,86,93,98,1,100
...

The job of your models is to predict the target (a.k.a. class) value of the Distribution from the set
{uniform, normal, bimodal, exponential, revexponential} based on correlations with subsets of
the other, non-target attributes.

The coding details are in file CSC523f24ClassifyAssn1_generator.py in the handout directory.

$ make student
grep 'STUDENT *[1-9]' CSC523f24ClassifyAssn1_generator.py
# STUDENT 1 (1%): Complete the above comment block. Fill in the blanks.
    # STUDENT 2 (4%) Read and extract CSC523F24Assn1Student.csv.
    # STUDENT 3 (15%) Extract and save CSC523F24Assn1MinAttrs.csv:
    # STUDENT 4 (10%): Select the first 10 instances of minAttrsData:
    # STUDENT 5 (10%): Construct 8 Classifier objects (1.25% each) spec'd below.
    # STUDENT 5a. Construct two DecisionTreeClassifier, one assigned into
    # STUDENT 5b. Construct a GaussianNB() classifier with its defaul parameters,
    # STUDENT 5c. Construct two KNeighborsClassifier objects,

Detailed instructions follow each of those. When coding is completed, make clean test tests the code,
giving this answer at the end when everything works:
TESTS PASS. Make sure to complete README.txt before make turnitin.

The following error indicates that you are not logged into ssh K120023GEMS.kutztown.edu for testing:
Traceback (most recent call last):
File "CSC523f24ClassifyAssn1_main.py", line 31, in <module>
    from sklearn.tree import export_text # These 3 needed for printing trees
ImportError: cannot import name 'export_text' from 'sklearn.tree' (/opt/anaconda
3/lib/python3.7/site-packages/sklearn/tree/__init__.py)
make: *** [test] Error 1

If there is a difference between the program's output file and an expected reference file (.ref),
the Linux diff utility reports a difference and leaves the details in a .dif file. We will go over how
to interpret these files.

You can edit answers into README.txt before or after completing make test, although coding
informs some of the answers. Please take notes in class. I will go over a lot of how this
project infrastructure interacts with Python libraries including scikit-learn for building and testing models.

When one final make clean test passes and you have answered all questions in README.txt,
run command line make turnitin (NOT the turnin script used in some other courses) and hit Enter
when prompted. If make turnitin does not report an error, I have received your project. You will
not receive an email.

Here are some Weka scatter plots pulled from your input data file CSC523F24Assn1Handout.csv.gz.
These relate to several questions in README.txt

Figure 1: Distribution class values as a function of Median

Figure 2: Distribution class values as a function of Pstdev

Figure 3: Distribution class values as a function of P75 (75th percentile of data)