CSC 523 - Scripting for Data Science, Fall 2022, Assignment 4 (official).

Assignment 4 due by 11:59 PM on Tuesday November 22 via "make turnitin". You must test on mcgonagall.


After 11/7 handout class:
As I mentioned in class, in CSC523Fall2022TimeMIDI_generator.py, lines 198 and 207 in the original handout pseudo-code:

198     #   APPEND the notenum into list TmovementChannelToTonic[key][0].
207     #   APPEND the notenum into list TmovementChannelToTonic[key][0].

SHOULD SAY:

198     #   APPEND the (notenum%12) into list TmovementChannelToTonic[key][0].
207     #   APPEND the (notenum%12) into list TmovementChannelToTonic[key][0].

That modulo 12 operator has to be in there to discard the octave (how far up the piano keyboard), which is irrelevant to our analysis.

Also if you cannot find CSC523Fall2022TimeMIDIOut.sorted.txt.ref, it is here:

~parson/DataMine/CSC523Fall2022TimeMIDI/CSC523Fall2022TimeMIDIOut.sorted.txt.ref

I fixed both problems & re-zipped the handout .zip file, but if you got an earlier version, you'll need this info.

1. To get the assignment on mcgonagall (ssh mcgonagall or ssh kupapcsit01 from acad):
    mkdir ~/DataMine   #  (This is in your home directory; it may exists if you took one of my data science courses.)
    cd ~/DataMine
    cp ~parson/DataMine/CSC523Fall2022TimeMIDI.problem.zip CSC523Fall2022TimeMIDI.problem.zip
    unzip CSC523Fall2022TimeMIDI.problem.zip
    cd ./CSC523Fall2022TimeMIDI
    make clean test
   
# This fails on handout code due to mising code.

    # ADDED 11/5 if you get an error ModuleNotFoundError: No module named 'xgboost', just comment out the import of xgboost.
    #    Thanks to Bob Elward for the catch. I updated the handout code late 11/4.

These are the attributes we will use in this analysis in fall2022concert_train.arff.gz and fall2022concert_test.arff.gz.
We discussed time-series data analysis and this dataset on October 31.

@attribute movement numeric        # A movement of a musical piece, conceptually a song, numbered 0 through 3.
@attribute channel numeric            # A MIDI channel, conceptually a musician playing an instrument, numbered 0 through 3.
@attribute command {noteon, noteoff}    # Whether the musician has just played or stopped playing a note.
                                                                 # handout code filters out noteoff because it adds no information to scale analysis,
                                                                 # then removes attributes command and velocity. It must wait to remove attribute
                                                                # command until it has removed noteoff instances so it can find them.
@attribute notenum numeric            # The note 0 through 127 being played. Think of a piano keyboard.
@attribute velocity numeric              # How hard the note is played, irrelevant for scale analysis.
@attribute tick numeric                    # The time with the (movement, channel) sequence, needed for note lagging
@attribute ttonic numeric                 # The so-called "do note" or tonic or root, which is the key pitch of the scale.
                                                         # Initially ttonic is tagged data by the score generator.
                                                         # Handout code derives an empircal ttonic by taking the statistical mode of the notes
                                                         # played in a given (movement, channel).
@attribute tmode {Ionian,Mixolydian,Lydian,Aeolian,Dorian,Phrygian,Locrian,Chromatic} # scale being played

tmode

Figure 1: Distribution of tmode in the training dataset

We will classify tmode from other attributes. The test dataset has an identical but independent distribution to the training data. My genmidi.py Jython script generated training and testing data using different pseudo-random number seeds with distributions seen in that link. You do not need to understand the music theory in genmidi.py to do the assignment.

normnotes

Figure 2: Distribution of normalized notes in the original data colored by correlation to tmode

In Figure 2 a normalized note is the notes distance from the tagged ttonic, which is the generator's intended "do" note in the tmode. My preparation script extracts an actual "do" note as the statistical mode (most frequently occurring value) of the normalized notes. Here is some diagnostic output from that script.

Computed and tagged tonic by (movement,channel):
('test', 0, 0) -> [7, 7]
('test', 0, 1) -> [7, 7]
('test', 0, 2) -> [7, 7]
('test', 0, 3) -> [7, 7]
('test', 1, 0) -> [9, 9]
('test', 1, 1) -> [9, 9]
('test', 1, 2) -> [9, 9]
('test', 1, 3) -> [9, 9]
('test', 2, 0) -> [7, 7]
('test', 2, 1) -> [7, 7]
('test', 2, 2) -> [7, 7]
('test', 2, 3) -> [10, 7]
('test', 3, 0) -> [7, 7]
('test', 3, 1) -> [7, 7]
('test', 3, 2) -> [7, 7]
('test', 3, 3) -> [7, 7]
('train', 0, 0) -> [7, 7]
('train', 0, 1) -> [7, 7]
('train', 0, 2) -> [7, 7]
('train', 0, 3) -> [7, 7]
('train', 1, 0) -> [9, 9]
('train', 1, 1) -> [9, 9]
('train', 1, 2) -> [9, 9]
('train', 1, 3) -> [9, 9]
('train', 2, 0) -> [7, 7]
('train', 2, 1) -> [7, 7]
('train', 2, 2) -> [7, 7]
('train', 2, 3) -> [1, 7]
('train', 3, 0) -> [7, 7]
('train', 3, 1) -> [7, 7]
('train', 3, 2) -> [7, 7]
('train', 3, 3) -> [7, 7]

All extracted tonics are the same as the tagged tonics except for these two:

('test', 2, 3) -> [10, 7]
('train', 2, 3) -> [1, 7]

This mismatch occurs because channel 3 uses a chromatic scale (mode) with uniform note distribution in movement 2, scattering notes with no pronounced tonic center. Channel 2 also
uses a chromatic scale with uniform note distribution in movement 2. Channels 0 and 1 use more constrained modes but with uniform distribution in movement 2. Movements 0, 1, and 3 use Gaussian generation of notes in the tmode, generating more predictive notes for each target mode.

Handout code also derives these attributes:

@attribute lagNote_0 numeric
@attribute lagNote_1 numeric
@attribute lagNote_2 numeric
@attribute lagNote_3 numeric
@attribute lagNote_4 numeric
@attribute lagNote_5 numeric
@attribute lagNote_6 numeric
@attribute lagNote_7 numeric
@attribute lagNote_8 numeric
@attribute lagNote_9 numeric
@attribute lagNote_10 numeric
@attribute lagNote_11 numeric

These are histogram sums of intervals in one scale with the ttonic at lagNote_0 being the extracted ttonic and the others being steps on the piano above that, up to but not including the next octave. The so-called FULL datasets include all of the above attributes, while the MIN datasets include only the latter 12-interval counts.

2. Edit CSC523Fall2022TimeMIDI_generator.py and search for upper-case STUDENT comments. The requirements with % point values are there. There is also a file README.txt in which you must answer questions. The code is worth 30% of the assignment and README.txt the remaining 70%. You can work on the README.txt before completing the code if you want since it uses the Reference files of expected output.

MAKE SURE to search for all upper-case STUDENT comments.


3. We will go over this handout code in class on November 7. When you are ready to test your code, type make clean test in the code directory. A successful test run looks like this:

$ make clean test
/bin/rm -f *.o *.class .jar core *.exe *.obj *.pyc __pycache__/*.pyc
/bin/rm -f junk* *.pyc CSC523Fall2022TimeMIDITrace.txt CSC523Fall2022TimeMIDIOut.txt
/bin/rm -f *.tmp *.o *.dif *.out *.csv __pycache__/*
/bin/rm -f ./MIDIdata/* DEBUG*arff*
/bin/rm -f CSC523Fall2022TimeMIDITrace.txt CSC523Fall2022TimeMIDIOut.txt CSC523Fall2022TimeMIDIOut.sorted.txt.ref
/usr/local/bin/python3.7 CSC523Fall2022TimeMIDI_main.py CSC523Fall2022TimeMIDITrace.txt CSC523Fall2022TimeMIDI_generator fall2022concert_train.arff.gz fall2022concert_test.a
rff.gz  > CSC523Fall2022TimeMIDIOut.txt
diff --ignore-trailing-space --strip-trailing-cr CSC523Fall2022TimeMIDIOut.txt CSC523Fall2022TimeMIDIOut.txt.ref > CSC523Fall2022TimeMIDIOut.txt.dif
sed -e 's/^TIME[^D]*DATA/DATA/' < CSC523Fall2022TimeMIDITrace.txt > CSC523Fall2022TimeMIDITrace.tmp
diff --ignore-trailing-space --strip-trailing-cr CSC523Fall2022TimeMIDITrace.tmp CSC523Fall2022TimeMIDITrace.txt.ref > CSC523Fall2022TimeMIDITrace.txt.dif
# grep DATA CSC523Fall2022TimeMIDIOut.txt.ref | sort -rn -t' ' -k13 --stable | sed -e 's/^DATA/\nDATA/' | grep REGRESSOR > CSC523Fall2022TimeMIDIOut.sorted.txt.ref
# echo "" >> CSC523Fall2022TimeMIDIOut.sorted.txt.ref
grep DATA CSC523Fall2022TimeMIDIOut.txt.ref | sort -rn -t' ' -k17 --stable | sed -e 's/^DATA/\nDATA/' | grep CLASSIFIER >> CSC523Fall2022TimeMIDIOut.sorted.txt.ref

4. Make sure to answer all questions in README.txt. You can do this before your coding is working.

5. After make clean test works without errors (terminates without an error message) and README.txt is answered, type make turnitin and hit Enter at the prompt to get your work to me before the deadline.

If you make subsequent changes and make clean test still passes, you can run make turnitin again and over-write your previous submission. Note that this is not the "turnin" script you may have used in other courses.

There is a 10% per day penalty for late assignments in my courses and I cannot grant any points after I go over a solution.