CPSC 523 - Scripting for Data Science, Fall 2024,
Tuesday 6:00-8:50 PM, Old Main 158 . See DEBUGGING AID AT BOTTOM OF THIS PAGE 11/4
Assignment 3 is due via make turnitin on acad or
K120023GEMS by 11:59 PM on Saturday November 23.
Late charge is the standard 10% per day. Assignments must be in before I go over my solution.

NOTE ADDED NOVEMBER 5:

Please leave this code that immediately precedes STUDENT 2 in Assignment 3's generator as I handed it out:

    # ORIGINAL HANDOUT CODE:
    # seedless shuffle of data rows checks student sorting later on.
    InputData[0][1] = shuffle(InputData[0][1])
    InputData[1][1] = shuffle(InputData[0][1]) # LEAVE THIS AS HANDED OUT
    tracefile = open(tracefilename, 'w', newline='')

We will discuss in the 11/5 class.

This is a time-series data analysis problem using data from the data domain of
this semester's CSC558 Assignment 3.

We will go over the data domain & mechanical steps on October 22.
There is no README in this assignment. That will come later when I have time.
I will explain.

Perform the following steps on K120023GEMS.kutztown.edu:

cd                                    # places you into your login directory
mkdir DataMine              # all of your csc523 projects go into this directory
cd  ./DataMine                 # makes DataMine your current working directory, it may already exist
cp  ~parson/DataMine/CSC523Fall2024TimeMIDI.problem.zip  CSC523Fall2024TimeMIDI.problem.zip
unzip  CSC523Fall2024TimeMIDI.problem.zip    # unzips your working copy of the project directory
cd  ./CSC523Fall2024TimeMIDI                            # your project working directory

Perform all test execution on K120023GEMS.kutztown.edu to avoid any platform-dependent output differences.
Here are the files of interest in this project directory. There are a few you can ignore.

CSC523Fall2024TimeMIDI_generator.py  # your time lagging/classification/clustering work goes here
CSC523Fall2024TimeMIDI_main.py  # Parson's handout code for building & testing models that your generator above provides
makefile                             # the Linux make utility uses this script to direct testing & data viz graphing actions
makelib                              # my library for the makefile
fall2024concert_train.csv.gz and fall2024concert_test.csv.gz are input training and testing files.
CSC523assn3_train_fullag.ref.csv.gz and CSC523assn3_test_fullag.ref.csv.gz are output reference file.
Assignment 2 shows how to unzip and look at .csv.gz file contents.

Any make test, make clean, or make turnitin will remove any generated file.

Here is the summary of STUDENT coding steps.
# STUDENT 1 (1%) Complete the above documentation comments. Fill in blanks.
# STUDENT 2 (20%): Clean the incoming data.
# STUDENT 3 (20%): Find the actual dtonic in each (movement, channel) section
# STUDENT 4 (20%): STEP 4 lagging OutputData[0][1] and OutputData[0][1].
# STUDENT 5 (20%): Project away movement, channel, notenum, tick, dtonic],
# STUDENT 6 (19%) project  attribute 'channel' out of clHdr and clData

Make sure that you answer all make test tests pass before you run make turnitin.

DEBUGGING AID ADDED NOV. 4
Bugs in step STUDENT 2 may not surface as symptoms until downstream in the code.
I put the diff for the output csv files early in the makefile, but the program may blow up before makefile gets that far.

Therefore, after the last step of STUDENT 2:

# 1e. Call writeCSVfile where the filename is arglist[0] with the
    # string 'CLEANED.' prepended onto the front of arglist[0] and WITH
    # POSSIBLE EXTENSION '.gz' removed if there is one, and
    # the other two arguments to writeCSVfile are InputData[0][0] and
    # InputData[0][1]. Do another call to writeCSVfile with the same
    # 'CLEANED.' prefix using arglist[1] (without any '.gz' -- write
    # an uncompressed .csv file), InputData[1][0], and InputData[1][1].

You can insert these lines of code:

    trainOK = os.system('PYTHONPATH=. /usr/bin/python3.11 ./diffcsvWgz.py CLEANED.fall2024concert_train.csv CLEANED.fall2024concert_train.ref.csv')
    testOK = os.system('PYTHONPATH=. /usr/bin/python3.11 ./diffcsvWgz.py CLEANED.fall2024concert_test.csv CLEANED.fall2024concert_test.ref.csv')
    if trainOK != 0:
        sys.stderr.write('PROBLEM WITH CLEANED.fall2024concert_train.csv')
    if testOK != 0:
        sys.stderr.write('PROBLEM WITH CLEANED.fall2024concert_test.csv')
    if trainOK or testOK:
        sys.exit(trainOK | testOK)

If those test fail, watch for output like this:

PYTHONPATH=. /usr/bin/python3.11 diffcsvWgz.py CLEANED.fall2024concert_test.csv CLEANED.fall2024concert_test.ref.csv
        Mismatch at attribute notenum < 60.0 > 79.0 LINE 1
        Mismatch at attribute notenum < 62.0 > 74.0 LINE 3
        Mismatch at attribute notenum < 71.0 > 62.0 LINE 4
        Mismatch at attribute notenum < 71.0 > 79.0 LINE 7
        Mismatch at attribute notenum < 67.0 > 79.0 LINE 8

where '<' is your output and '>' is the reference file.