README.txt, CSC523 Fall 2024, Assignment 4 Due by 11:59 PM on Saturday November 23 via D2L Assignment 4. STUDENT NAME: Keep the format below the same and just add your answers below the questions. These results come out of Reference files of coding-only Assignment 3: Each answer Q1-Q10 below is worth 10% of the project grade, 100% total. READ THE FOLLOWING AND ASK QUESTIONS AS NEEDED. The following attributes are in the FullLag training and testing non-target datasets, from CSC523Fall2024TimeMIDISummary.txt.ref: DATA 14 FullLag CLASSIFIER GaussianNB TRAIN # 4154 TEST # 4154 kappa 1.000000 Co rrect 4154 %correct 1.000000 ATTRIBUTES FOR DATA 14['movement', 'channel', 'notenum', 'tick', 'dtonic', 'lagN ote_0', 'lagNote_1', 'lagNote_2', 'lagNote_3', 'lagNote_4', 'lagNote_5', 'lagNot e_6', 'lagNote_7', 'lagNote_8', 'lagNote_9', 'lagNote_10', 'lagNote_11'] -> tmod e (trained using CSC523assn3_train_fullag.csv.gz dataset and tested using CSC523assn3_test_fullag.csv.gz): $ zcat CSC523assn3_train_fullag.csv.gz |head -1 | sed -e 's/,/\n/g' movement channel notenum tick dtonic lagNote_0 lagNote_1 lagNote_2 lagNote_3 lagNote_4 lagNote_5 lagNote_6 lagNote_7 lagNote_8 lagNote_9 lagNote_10 lagNote_11 tmode $ zcat CSC523assn3_test_fullag.csv.gz |head -1 | sed -e 's/,/\n/g' movement channel notenum tick dtonic lagNote_0 lagNote_1 lagNote_2 lagNote_3 lagNote_4 lagNote_5 lagNote_6 lagNote_7 lagNote_8 lagNote_9 lagNote_10 lagNote_11 tmode 1. movement being the portion of the concert in range [0,3], 2. channel being the MIDI musician also in range [0,3] -- there are 4 musicians playing in each of the 4 movements -- 3. notenum being the distance between the actual note and its dtonic "do note", within a single reference octave [0, 11]. 4. tick being the time within the movement when the note is played, 5. dtonic being the extracted "do" note (as in do re mi fa so la ti do)", the most frequently occuring note in the (movement, channel). 6-17. each of lagNote_0 through lagNote_11 being the COUNT (histogram) of lagged notenum from previous instances within the current (movement, channel), lagged by 0 through 12 previous notes. "lagNote_0" is the count of tonic notes within the lag period, and "lagNote_11" is the count of "major 7ths" within the lag period. You do not need to understand music theory, only the fact that the distribution of counts in the lagNote_0..lagNote_11 correlates to the tmode (scale) which is the target attribute. Knowing attributes movement, channel, notenum, tick, and ttonic is like reading the musical score; dtonic, unlike ttonic in the CSC558 assignment, is extracted from the notenums for each (movement, channel) pair. The following attributes are in the LagOnly training and testing non-target datasets after movement through dtonic are projected away. These are the notes an attendee would actually hear within a given movement from a given channel-musician. High and low lagNote_N counters in this 12-bin histogram act as signatures for the mode classes in which they do or do not appear. From CSC523Fall2024TimeMIDISummary.txt.ref: DATA 33 LagOnly CLASSIFIER KNeighborsClassifier_1 TRAIN # 4154 TEST # 4154 kappa 0.933538 Correct 3933 %correct 0.946798 ATTRIBUTES FOR DATA 33['lagNote_0', 'lagNote_1', 'lagNote_2', 'lagNote_3', 'lagN ote_4', 'lagNote_5', 'lagNote_6', 'lagNote_7', 'lagNote_8', 'lagNote_9', 'lagNot e_10', 'lagNote_11'] -> tmode The following discussions for kappa values sorted in descending order comes after fixing a bug in the makefile to use the correct field for sorting: grep DATA CSC523Fall2024TimeMIDISummary.txt.ref | sort -rn -t' ' -k13 --stable | grep CLASSIFIER | sed -e 's/^DATA/\nDATA/' > CSC523Fall2024TimeMIDISummary.sorted.txt.ref That change in the sort from "-k17" for the kappa field does not change anything about Assignment 3. There is no test against sorted file CSC523Fall2024TimeMIDISummary.sorted.txt.ref, which is generated for sorted inspection for the README but is not diff'd. Classification in this assignment uses kappa values as the measure of accuracy. Additional background From my genmidi.py Jython script that generates these training and testing datasets: # Start of genmidi.py excerpts VVVVV # Major modes (major 3rd) __IonianMode__ = [0, 2, 4, 5, 7, 9, 11, 12] # a.k.a. major scale __LydianMode__ = [0, 2, 4, 6, 7, 9, 11, 12] # fourth is sharp __MixolydianMode__ = [0, 2, 4, 5, 7, 9, 10, 12] # seventh is flat # Minor modes (minor 3rd) __AeolianMode__ = [0, 2, 3, 5, 7, 8, 10, 12] # natural minor scale __DorianMode__ = [0, 2, 3, 5, 7, 9, 10, 12] # sixth is sharp __Phrygian__ = [0, 1, 3, 5, 7, 8, 10, 12] # 2nd is flat # Locrian has Minor third, also known as a dimished mode because of flat 5th __LocrianMode__ = [0, 1, 3, 5, 6, 8, 10, 12] # 2nd is flat, 5th is flat # Chromatic is not a mode, it is just all the notes. __Chromatic__ = [i for i in range(0, 13)] __movementModes__ = [ # 4 entries per movement, 1 for each of 4 MIDI channels [__IonianMode__, __IonianMode__, __MixolydianMode__, __LydianMode__], [__AeolianMode__, __AeolianMode__, __DorianMode__, __Phrygian__], # Give the lead instrument Ionia in the dissonant section for added tension. [__IonianMode__, __LocrianMode__, __Chromatic__, __Chromatic__], [__IonianMode__, __IonianMode__, __MixolydianMode__, __LydianMode__], ] # reprise first movement in fourth movement __RandomGenerators__ = [ # each instrument channel uses the same gen # There are 4 for 4 movements. __genGaussianClosure__(3), __genGaussianClosure__(3), __genUniformClosure__(), __genGaussianClosure__(4)] __TONIC__ = [7, 9, 7, 7] # Gmaj, Amin, G?, Gmaj by movement. def __genGaussianClosure__(sigma): ''' Bind sigma for random.gauss. The returned function takes a list, placing the Gaussian peak "mu" at element 0 and tailing out from there, mapping the tails to elements 1..n-1. "mu" = 0 is the mean, sigma is standard dev. An example modeNoteList is __IonianMode__ = [0, 2, 4, 5, 7, 9, 11, 12] . ''' def genGaussian(modeNoteList): g = int(round(abs(random.gauss(0.0,sigma)))) % len(modeNoteList) return(modeNoteList[g]) return genGaussian def __genUniformClosure__(): ''' Bind random.uniform. The returned function takes a list, returning a value in a uniform distribution [0,n) for the argument list of length n. ''' def genUniform(modeNoteList): g = int(round(abs(random.uniform( 0,len(modeNoteList))))) % len(modeNoteList) return(modeNoteList[g]) return genUniform ... rndgen = __RandomGenerators__[movement] # all channels use the same ... interval = rndgen(mymode) # End of genmidi.py excerpts ^^^^^ ---------------------------------------------------------------- Q1: Per CSC523Fall2024TimeMIDISummary.sorted.txt.ref the highest-kappa values in their original (--stable) relative order are: DATA 14 FullLag CLASSIFIER GaussianNB TRAIN # 4154 TEST # 4154 kappa 1.000000 Correct 4154 %correct 1.000000 DATA 16 FullLag CLASSIFIER CategoricalNB TRAIN # 4154 TEST # 4154 kappa 1.000000 Correct 4154 %correct 1.000000 DATA 18 FullLag CLASSIFIER DecisionTreeClassifier TRAIN # 4154 TEST # 4154 kappa 1.000000 Correct 4154 %correct 1.000000 DATA 20 FullLag CLASSIFIER BaggingClassfier TRAIN # 4154 TEST # 4154 kappa 1.000 000 Correct 4154 %correct 1.000000 DATA 21 FullLag CLASSIFIER KNeighborsClassifier_1 TRAIN # 4154 TEST # 4154 kappa 1.000000 Correct 4154 %correct 1.000000 Note the accuracy of GaussianNB and CategoricalNB, the two Naive Bayes classifiers in this higest-kappa group. GaussianNB is our basic Naive Bayes algorithm (conditional probability based) discussed earlier in the semester and reviewed on handout. https://en.wikipedia.org/wiki/Bayes%27_theorem https://faculty.kutztown.edu/parson/spring2024/BayesCards.txt Now note the relative order in terms of kappa of these 3 Naive Bayes classifiers for the LagOnly dataset consisting of the 12 lagNote_N counters and tmode. DATA 28 LagOnly CLASSIFIER CategoricalNB TRAIN # 4154 TEST # 4154 kappa 0.824618 Correct 3576 %correct 0.860857 DATA 27 LagOnly CLASSIFIER MultinomialNB TRAIN # 4154 TEST # 4154 kappa 0.799447 Correct 3500 %correct 0.842561 DATA 26 LagOnly CLASSIFIER GaussianNB TRAIN # 4154 TEST # 4154 kappa 0.786129 Correct 3445 %correct 0.829321 Given the summary paragraph of these documentation pages: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html Background: Normal (a.k.a. Gaussian, a.k.a. bell-shaped curve): "The normal distribution is extremely important because: many real-world phenomena involve random quantities that are approximately normal (e.g., errors in scientific measurement); it plays a crucial role in the Central Limit Theorem, one of the fundamental results in statistics; its great analytical tractability makes it very popular in statistical modelling." https://www.statlect.com/probability-distributions/normal-distribution QUESTION 1: Why does MultinomialNB outperform GaussianNB on the LagOnly data, even though it underperforms MultinomialNB on FullLag as seen here? DATA 15 FullLag CLASSIFIER MultinomialNB TRAIN # 4154 TEST # 4154 kappa 0.815534 Correct 3528 %correct 0.849302 STUDENT ANSWER: ---------------------------------------------------------------- Q2: Following from QUESTION 1, Why does CategoricalNB outperform GaussianNB on the LagOnly data? Additional background: "A categorical distribution is a discrete probability distribution that describes the probability that a random variable will take on a value that belongs to one of K categories, where each category has a probability associated with it. For a distribution to be classified as a categorical distribution, it must meet the following criteria: The categories are discrete. There are two or more potential categories. The probability that the random variable takes on a value in each category must be between 0 and 1. The sum of the probabilities for all categories must sum to 1. " https://www.statology.org/categorical-distribution/ {IonianMode, LydianMode, MixolydianMode, AeolianMode, DorianMode, Phrygian, LocrianMode, Chromatic} STUDENT ANSWER: ---------------------------------------------------------------- Q3: Per CSC523Fall2024TimeMIDISummary.txt.ref, the linear-distance, instance-based classifiers KNeighborsClassifier_n, where "n" gives the number of k-nearest-neighbors of training instances to the current test instance, on FullLag data including movcement, channel, notenum, tick, dtonic, and lagNote_? non-target attributes: DATA 21 FullLag CLASSIFIER KNeighborsClassifier_1 TRAIN # 4154 TEST # 4154 kappa 1.000000 Correct 4154 %correct 1.000000 DATA 22 FullLag CLASSIFIER KNeighborsClassifier_2 TRAIN # 4154 TEST # 4154 kappa 0.801333 Correct 3501 %correct 0.842802 DATA 23 FullLag CLASSIFIER KNeighborsClassifier_3 TRAIN # 4154 TEST # 4154 kappa 0.796141 Correct 3483 %correct 0.838469 DATA 24 FullLag CLASSIFIER KNeighborsClassifier_4 TRAIN # 4154 TEST # 4154 kappa 0.727028 Correct 3266 %correct 0.786230 DATA 25 FullLag CLASSIFIER KNeighborsClassifier_5 TRAIN # 4154 TEST # 4154 kappa 0.696630 Correct 3174 %correct 0.764083 https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html WHY WOULD THE KAPPA VALUES DECREASE AS THE NUMBER OF K-NEAREST-NEIGHBORS INCREASE, FROM 1 TO 5 IN THIS EXAMPLE? WOULD YOU EXPECT KAPPA ALWAYS TO DECREASE AS K-NEAREST-NEIGHBORS INCREASES? STUDENT ANSWER: ---------------------------------------------------------------- Q4: Now note these KNN classifications for the LagOnly data containing only lagNote_? counters and target tmode: DATA 33 LagOnly CLASSIFIER KNeighborsClassifier_1 TRAIN # 4154 TEST # 4154 kappa 0.933538 Correct 3933 %correct 0.946798 DATA 34 LagOnly CLASSIFIER KNeighborsClassifier_2 TRAIN # 4154 TEST # 4154 kappa 0.886866 Correct 3781 %correct 0.910207 DATA 35 LagOnly CLASSIFIER KNeighborsClassifier_3 TRAIN # 4154 TEST # 4154 kappa 0.904158 Correct 3836 %correct 0.923447 DATA 36 LagOnly CLASSIFIER KNeighborsClassifier_4 TRAIN # 4154 TEST # 4154 kappa 0.871395 Correct 3730 %correct 0.897930 DATA 37 LagOnly CLASSIFIER KNeighborsClassifier_5 TRAIN # 4154 TEST # 4154 kappa 0.885959 Correct 3776 %correct 0.909003 QUESTION: WHY DO AN ODD NUMBER OF NEAREST NEIGHBORS SHOW A HIGHER KAPPA THAN THEIR EVEN NUMBERED IMMEDIATE PREDECESSORS, E.G., KNeighborsClassifier_3'S KAPPA OF 0.904158 IS > KNeighborsClassifier_2'S 0.886866; KNeighborsClassifier_5'S KAPPA OF 0.885959 IS > KNeighborsClassifier_4'S KAPPA OF 0.871395? Background: Python's statistics module's mode function mode(data) Returns the most common data point from discrete or nominal data. STUDENT ANSWER: ---------------------------------------------------------------- Q5: From CSC523Fall2024TimeMIDITrace.txt.ref note the dtonic value extracted from data (not a tagged generator parameter ttonic) as a function of (movement, channel) pairs for training and testing data: Extracted dtonic by (movement,channel): ('train', (0, 0), 7) ('train', (0, 1), 7) ('train', (0, 2), 7) ('train', (0, 3), 7) ('train', (1, 0), 9) ('train', (1, 1), 9) ('train', (1, 2), 9) ('train', (1, 3), 9) ('train', (2, 0), 7) ('train', (2, 1), 7) ('train', (2, 2), 7) ('train', (2, 3), 1) ('train', (3, 0), 7) ('train', (3, 1), 7) ('train', (3, 2), 7) ('train', (3, 3), 7) Extracted dtonic by (movement,channel): ('test', (0, 0), 7) ('test', (0, 1), 7) ('test', (0, 2), 7) ('test', (0, 3), 7) ('test', (1, 0), 9) ('test', (1, 1), 9) ('test', (1, 2), 9) ('test', (1, 3), 9) ('test', (2, 0), 7) ('test', (2, 1), 7) ('test', (2, 2), 7) ('test', (2, 3), 1) ('test', (3, 0), 7) ('test', (3, 1), 7) ('test', (3, 2), 7) ('test', (3, 3), 7) COMPARE THE EXTRACTED DTONIC FOR EACH (MOVEMENT, CHANNEL) TO THE genmidi.py GENERATOR PARAMETERS: __TONIC__ = [7, 9, 7, 7] # Gmaj, Amin, G?, Gmaj by movement. __movementModes__ = [ # 4 entries per movement, 1 for each of 4 MIDI channels [__IonianMode__, __IonianMode__, __MixolydianMode__, __LydianMode__], [__AeolianMode__, __AeolianMode__, __DorianMode__, __Phrygian__], # Give the lead instrument Ionia in the dissonant section for added tension. [__IonianMode__, __LocrianMode__, __Chromatic__, __Chromatic__], [__IonianMode__, __IonianMode__, __MixolydianMode__, __LydianMode__], ] # reprise first movement in fourth movement __RandomGenerators__ = [ # each instrument channel uses the same gen # There are 4 for 4 movements. __genGaussianClosure__(3), __genGaussianClosure__(3), __genUniformClosure__(), __genGaussianClosure__(4)] ARE THERE ANY MISMATCHES BETWEEN EXTRACTED dtonic VALUES AND THE genmidi.py __TONIC__ VALUES? IF SO, WHAT ACCOUNTS FOR THE MISMATCHES? STUDENT ANSWER: ---------------------------------------------------------------- Q6: Per CSC523Fall2024TimeMIDISummary.txt.ref: DATA 39 Mv_Ch_TONIC_MODE CLUSTERER dbscan5_0.1 INSTANCES # 8308 ['CL', 'movement', 'channel', 'dtonic', 'tmode'] [[0, 0, 0, 7, 'Ionian'], [1, 0, 1, 7, 'Ionian'], [2, 0, 2, 7, 'Mixolydian'], [3, 0, 3, 7, 'Lydian'], [4, 1, 0, 9, 'Aeolian'], [5, 1, 1, 9, 'Aeolian'], [6, 1, 2, 9, 'Dorian'], [7, 1, 3, 9, 'Phrygian'], [8, 2, 0, 7, 'Ionian'], [9, 2, 1, 7, 'Locrian'], [10, 2, 2, 7, 'Chromatic'], [11, 2, 3, 1, 'Chromatic'], [12, 3, 0, 7, 'Ionian'], [13, 3, 1, 7, 'Ionian'], [14, 3, 2, 7, 'Mixolydian'], [15, 3, 3, 7, 'Lydian']] generated by this clusterer at STEP 7 in the handout code, with these attributes from training + testing set AllData: AllData = OutputData[0][1] + OutputData[1][1] keepers = ['movement', 'channel', 'dtonic', 'tmode'] DBSCAN(min_samples=6, eps=0.1), 'dbscan5_0.1') https://scikit-learn.org/1.5/modules/clustering.html#dbscan ^^^ READ THE FIRST 2 PARAGRAPHS OF THE ABOVE REFERENCE. ^^^ https://scikit-learn.org/1.5/modules/generated/dbscan-function.html EVEN THOUGH DBSCAN(min_samples=6, eps=0.1) DOES NOT SPECIFY 16 CLUSTERS (4 MOVEMENTS X 4 CHANNELS), THAT IS HOW DBSCAN IDENTIFIED THE CLUSTERS. HOW CLOSELY DOES THE ABOVE DATA 39 CLUSTER MATCH genmidi.py's __movementModes__ table above? What might account for that precision? STUDENT ANSWER: ---------------------------------------------------------------- Q7: In Assignment 4's extended CSC523f23TimeSeriesAssn4_generator.py, I wrote the following code to use only the lagNote_? attributes and tmode: # ADD 1 FOR ASSN4 README: losers = ['movement', 'channel', 'notenum', 'tick', 'dtonic'] keepers = set(OutputData[0][0]) - set(losers) clHdrLag, _, clDataLag = project(OutputData[0][0], losers, AllData) feedbackList = [] clusterTableHdr = ['CL'] + clHdrLag # Table comes back in feedbackList for clusterer, cname in CLUSTERERS: yield(['cluster', 'Mv_Ch_LAGGED_MODE', cname, clusterer, clDataLag, clHdrLag, clHdrLag, None, (mode, mode), None, None, feedbackList, extract_stack()]) tracefile.write('LAGGED ONLY CLUSTER feedbackList ' + 'Mv_Ch_LAGGED_MODE ' + cname + '\n' + str(clusterTableHdr) + '\n' + pformat(feedbackList[0]) + '\n') Here is the result from CSC523Fall2024MIDIReadMe4Trace.txt for KMeans(n_clusters=n_clusters, algorithm='lloyd',verbose=0, init='random',n_init=1000, max_iter=1000, random_state=220223523) LAGGED ONLY CLUSTER feedbackList Mv_Ch_LAGGED_MODE kmeans_4 ['CL', 'lagNote_0', 'lagNote_1', 'lagNote_2', 'lagNote_3', 'lagNote_4', 'lagNote_5', 'l agNote_6', 'lagNote_7', 'lagNote_8', 'lagNote_9', 'lagNote_10', 'lagNote_11', 'tmode'] [[0, 5, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 1, 'Ionian'], [1, 5, 0, 0, 2, 0, 1, 0, 3, 0, 0, 1, 0, 'Aeolian'], [2, 5, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 'Mixolydian'], [3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 'Chromatic']] INSPECT THE DISTRIBUTION OF MODES IN __movementModes__ above. Why did kmeans with KNN=4 nearest neighbors form clusters with these tmode values of Ionian, Aeolian, Mixolydian, and Chromatic? STUDENT ANSWER: ---------------------------------------------------------------- Q8: What additional tmode value would you expect to see if we bumped the number of K-nearest-neightbors from 4 to 5? Why? STUDENT ANSWER: ---------------------------------------------------------------- Q9: In this added code for Assignment 4: # ADD 3 FOR ASSN4 README: c16 = KMeans(n_clusters=16,algorithm='lloyd',verbose=0,init='random', n_init=1000, max_iter=1000, random_state=220223523) c16s = ((c16, 'kmeans_16'),) clusterTableHdr = ['CL'] + clHdr # Table comes back in feedbackList for clusterer, cname in c16s: yield(['cluster', 'Mv_Ch_TONIC_MODE', cname, clusterer, clData, clHdr, clHdr, None, (mode, mode), None, None, feedbackList, extract_stack()]) feedbackList[0].sort(key=lambda row : (row[1], row[2])) tracefile.write('KNN=16 CLUSTER feedbackList ' + 'Mv_Ch_TONIC_MODE ' + cname + '\n' + str(clusterTableHdr) + '\n' + pformat(feedbackList[0]) + '\n') that produces this output in CSC523Fall2024MIDIReadMe4Trace.txt: KNN=16 CLUSTER feedbackList Mv_Ch_TONIC_MODE kmeans_16 ['CL', 'movement', 'channel', 'dtonic', 'tmode'] [[2, 0, 0, 7, 'Ionian'], [0, 0, 1, 7, 'Ionian'], [14, 0, 2, 7, 'Mixolydian'], [9, 0, 3, 7, 'Lydian'], [3, 1, 0, 9, 'Aeolian'], [1, 1, 1, 9, 'Aeolian'], [6, 1, 2, 9, 'Dorian'], [10, 1, 3, 9, 'Phrygian'], [15, 2, 0, 7, 'Ionian'], [7, 2, 1, 7, 'Locrian'], [8, 2, 2, 7, 'Chromatic'], [5, 2, 3, 1, 'Chromatic'], [12, 3, 0, 7, 'Ionian'], [4, 3, 1, 7, 'Ionian'], [13, 3, 2, 7, 'Mixolydian'], [11, 3, 3, 7, 'Lydian']] WHAT DOES THIS LINE OF CODE ACCOMPLISH? WHY IS IT THERE? feedbackList[0].sort(key=lambda row : (row[1], row[2])) Answer is not just to sort the table. Sort it how and why? STUDENT ANSWER: ---------------------------------------------------------------- Q10: In the code of Q9, what does this argument to yield() accomplish? Why are we using that argument? (mode, mode) STUDENT ANSWER: ----------------------------------------------------------------