README.assn3.txt, CSC558 Fall 2024, Assignment 3 Due by 11:59 PM on Saturday November 2 via D2L Assignment 3. There is the usual 10% reduction per day late and I cannot accept submissions after I go over the solution in class. STUDENT NAME: Add your name here. Keep the format below the same and just add your answers below the questions. Each answer Q1-Q10 below is worth 10% of the project grade, 100% total. Keep the prompt "YOUR ANSWER:" above your answers so I can search on that. READ THE FOLLOWING OVERVIEW AND ASK QUESTIONS AS NEEDED. This is an overview of the application but Q1-Q10 do not really require this application domain knowledge. We will go over lagged data in class. The following attributes are in the FULL training and testing non-target datasets, with movement being the portion of the concert in range [0,3], channel being the MIDI musician also in range [0,3] -- there are 4 musicians playing in each of the 4 movements -- notenum being the distance between the actual note and its tonic "do note", tick being the time within the movement when the note is played, ttonic being the extracted "do note (as in do re mi fa so la ti do)", where the "do" note is the most frequent actual note in a (movement, channel)'s sequence of MIDI notes. Each of lagNote_0 through lagNote_11 is the lagged notenum from up to previous 12 instances (determined by measuring maximum kappa fed back from _main.py module) within the current (movement, channel), lagged by previous notes. "lagNote_0" is the count of tonic notes within the lag period, and "lagNote_11" is the count of "major 7ths" within the lag period. You do not need to understand music theory, only the fact that the distribution of counts in the lagNote_0..lagNote_11 correlates to the tmode (scale) which is the target attribute. Knowing attributes movement, channel, notenum, tick, and ttonic is like reading the musical score. From the Python script that aggregates movement-channel note sequences into the lagged attributes: # From a "backInTime" time lag of 0 up through "backInTime" of -12, # copy lagged preceding (in time) normalized (w.r.t. extracted tonic) # into the current instance, stopping before "bakInTime" of -12 if # a peak in kappa is passed, i.e., passing point of diminishing # returns on kappa. Derive two "kappa-optimized" datasets as follows. # # CSC558assn3_train_fullag.arff.gz, CSC558assn3_test_fullag.arff.gz # use all attributes including movement, channel, ticks, etc., # through the lagged note histograms and then target attribute tmode. For all except the Chromatic target tmode, the order of importance of lagged note counts in determining the tmode should be: lagNote_0 # "do" note, i.e., tonic or root lagNote_7 # "so" note which is the musical 5th except Locrian mode lagNote_4 # "mi" note, the 3rd in major modes lagNote_5 # "fa" note (4th) in major & minor modes except Lydian lagNote_11 # "ti" 7th note in major scales lagNote_3 # "mi" note, the 3rd in minor modes lagNote_10 # "ti" 7th note in minor scales lagNote_9 # "la" 6th note in major scales and Dorian lagNote_6 # sharp 4th in Lydian mode, flat 5th in Locrian mode lagNote_8 # "la" 6th note in Aeolian, Phrygian, and Locrian minor modes lagNote_2 # "re" 2nd interval in all except Dorian and Locrian lagNote_1 # "re" 2nd interval in Dorian and Locrian modes The reason for enumerating these lagNote counters is to point out that the intervals near the top of that list are more consistent in predicting the target tmode than the ones near the bottom, which are consistent for some modes but not others. The Chromatic mode with uniformly-distributed notes in this dataset shows no such counter pattern. This is the FULL set of attributes: @attribute movement numeric # All musicians play together in a movement. @attribute channel numeric # A musician plays a MIDI channel per movement. @attribute notenum numeric # The is the interval from the ttonic, where # the ttonic is the most frequently played # MIDI note per a (movement,channel) pair. # "Interval" is the distance from ttonic. @attribute tick numeric # discrete time step in (movement,channel). @attribute ttonic numeric # "do" note derived from freqency of appearance. # Following are in order of attribute columns. lagNote_0 # "do" note, i.e., tonic or root lagNote_1 # "re" 2nd interval in Dorian and Locrian modes lagNote_2 # "re" 2nd interval in all except Dorian and Locrian lagNote_3 # "mi" note, the 3rd in minor modes lagNote_4 # "mi" note, the 3rd in major modes lagNote_5 # "fa" note (4th) in major & minor modes except Lydian lagNote_6 # sharp 4th in Lydian mode, flat 5th in Locrian mode lagNote_7 # "so" note which is the musical 5th except Locrian mode lagNote_8 # "la" 6th note in Aeolian, Phrygian, and Locrian minor modes lagNote_9 # "la" 6th note in major scales and Dorian lagNote_10 # "ti" 7th note in minor scales lagNote_11 # "ti" 7th note in major scales tmode # is the TAGGED, target attribute we are predicting. # "mode" and "scale" are equivalent in this assignment. The following attributes are in the MIN training and testing non-target datasets after movement..ttonic are projected away. These are the notes an attendee would actually hear within a given movement from a given channel-musician. This is the MIN set of lagged normalized (with respect to tonic intervals) attributes: @attribute lagNote_0 numeric @attribute lagNote_1 numeric @attribute lagNote_2 numeric @attribute lagNote_3 numeric @attribute lagNote_4 numeric @attribute lagNote_5 numeric @attribute lagNote_6 numeric @attribute lagNote_7 numeric @attribute lagNote_8 numeric @attribute lagNote_9 numeric @attribute lagNote_10 numeric @attribute lagNote_11 numeric @attribute tmode ---------------------------------------------------------------- Q1: After making sure you are using CSC558assn4_train_fulllag.arff.gz for training and CSC558assn4_test_fulllag.arff.gz for classification testing (per Classify tab "Supplied test set", run the classifier NaiveBayes under Choose -> classifiers -> bayes with its default config parameters. Record the following values using copy (control-C) from Weka then paste, pasting ONLY the fields from Weka in the templates below. YOUR ANSWER: NaiveBayes: Correctly Classified Instances NNNN N.nnnn % Incorrectly Classified Instances NNNN N.nnnn % Kappa statistic N.nnnn Mean absolute error N.nnnn Root mean squared error N.nnnn Relative absolute error N.nnnn % Root relative squared error N.nnnn % Total Number of Instances 4154 === Confusion Matrix === a b c d e f g h <-- classified as n n n n n n n n | a = attributeName n n n n n n n n | b = attributeName n n n n n n n n | c = attributeName n n n n n n n n | d = attributeName n n n n n n n n | e = attributeName n n n n n n n n | f = attributeName n n n n n n n n | g = attributeName n n n n n n n n | h = attributeName ---------------------------------------------------------------- Q2: List in the table below the misclassified instances from the confusion matrix including the counts. Start at the top row of the confusion matrix and work your way down in your answer. If more than 1 misclassification appears in a row, work left to right in that row. If a row has no incorrect classifications, skip it. YOUR ANSWER: CORRECT CLASS VALUE INCORRECT CLASSIFIED AS COUNT rowName columnName n ... ---------------------------------------------------------------- Q3: In the Preprocess tab remove attributes movement, channel, notenum, tick, and tonic, leaving attributes lagNote_0 through lagNote_11, which are counts of one octave of MIDI notes normalized as distance from the extracted tonic ("do" note), along with target attribute tmode. You may want to save this as a working file such as lagged.arff so you do not need to keep removing attributes in later steps. Here are the attributes: Instances: 4154 Attributes: 13 lagNote_0 lagNote_1 lagNote_2 lagNote_3 lagNote_4 lagNote_5 lagNote_6 lagNote_7 lagNote_8 lagNote_9 lagNote_10 lagNote_11 tmode Run these two classifiers with this reduced attribute set and record these results as before. NOTE that the first time you run tests, Weka may pop up a question about mapping the training attributes to the test data, since the test data file contains attributes movement, channel, notenum, tick, and tonic. Just accept the mapping Weka proposes. It ignores the removed training attributes during testing. Attribute mappings: Model attributes Incoming attributes ---------------------- ---------------- (numeric) lagNote_0 --> 6 (numeric) lagNote_0 (numeric) lagNote_1 --> 7 (numeric) lagNote_1 (numeric) lagNote_2 --> 8 (numeric) lagNote_2 (numeric) lagNote_3 --> 9 (numeric) lagNote_3 (numeric) lagNote_4 --> 10 (numeric) lagNote_4 (numeric) lagNote_5 --> 11 (numeric) lagNote_5 (numeric) lagNote_6 --> 12 (numeric) lagNote_6 (numeric) lagNote_7 --> 13 (numeric) lagNote_7 (numeric) lagNote_8 --> 14 (numeric) lagNote_8 (numeric) lagNote_9 --> 15 (numeric) lagNote_9 (numeric) lagNote_10 --> 16 (numeric) lagNote_10 (numeric) lagNote_11 --> 17 (numeric) lagNote_11 (nominal) tmode --> 18 (nominal) tmode YOUR ANSWER: NaiveBayes: Correctly Classified Instances NNNN N.nnnn % Incorrectly Classified Instances NNNN N.nnnn % Kappa statistic N.nnnn Mean absolute error N.nnnn Root mean squared error N.nnnn Relative absolute error N.nnnn % Root relative squared error N.nnnn % Total Number of Instances 4154 NaiveBayesMultinomial: Correctly Classified Instances NNNN N.nnnn % Incorrectly Classified Instances NNNN N.nnnn % Kappa statistic N.nnnn Mean absolute error N.nnnn Root mean squared error N.nnnn Relative absolute error N.nnnn % Root relative squared error N.nnnn % Total Number of Instances 4154 ---------------------------------------------------------------- Q4: What accounts for the higher kappa of NaiveBayesMultinomial over NaiveBayes in Q3, given the outline of NaiveBayesMultinomial in its introductory description paragraph here: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html YOUR ANSWER: ---------------------------------------------------------------- Q5: Restore the full set of attributes 18 Attributes: movement, channel, notenum, tick, ttonic, lagNote_0, lagNote_1, lagNote_2, lagNote_3, lagNote_4, lagNote_5, lagNote_6, lagNote_7, lagNote_8, lagNote_9, lagNote_10, lagNote_11, tmode by re-loading CSC558assn4_train_fulllag.arff.gz. Make sure CSC558assn4_test_fulllag.arff.gz is set for testing in the Classify tab. Always check that. Run classifier trees -> J48 with the default config parameters. Weka sets the "minNumObj" parameter that determines how many instances must reach each leaf in the tree to 2. I sometimes bump it up to 50 or 100 or a higher fraction of the total instances to get a more shallow and intelligible tree, usually at the cost of some kappa accuracy, but for these data leaving minObj at the default of 2 is OK. RECORD ONLY of the following Weka output below, including the FULL DECISION TREE and the CONFUSION MATRIX in addition to the error measures: YOUR ANSWER: J48 pruned tree ------------------ The FULL DECISION TREE appears here. Paste it in. Number of Leaves : N Size of the tree : N Time taken to build model: N.n seconds Correctly Classified Instances NNNN N.nnnn % Incorrectly Classified Instances NNNN N.nnnn % Kappa statistic N.nnnn Mean absolute error N.nnnn Root mean squared error N.nnnn Relative absolute error N.nnnn % Root relative squared error N.nnnn % Total Number of Instances 4154 === Confusion Matrix === a b c d e f g h <-- classified as n n n n n n n n | a = attributeName n n n n n n n n | b = attributeName n n n n n n n n | c = attributeName n n n n n n n n | d = attributeName n n n n n n n n | e = attributeName n n n n n n n n | f = attributeName n n n n n n n n | g = attributeName n n n n n n n n | h = attributeName ---------------------------------------------------------------- Q6: In your confusion matrix for Q5, which target ttonic attribute value was most often misclassified? How was it actually classified? YOUR ANSWER: ---------------------------------------------------------------- Q7: In your decision tree for Q5, what are the two primary attributes used to make classfication decisions? These are the two non-target attributes closest to the root of the tree, appearing furthest to the left in Weka's tree dump. Leaf node predictions appear to the right. YOUR ANSWER: ---------------------------------------------------------------- Q8: Make sure you have all 18 training attributes movement through tmode in the Preprocess tab and CSC558assn4_test_fulllag.arff.gz is set for testing in the Classify tab. Run the following 3 ensemble classifiers with the base classifier set as follows. Do not use the default classifier parameter for Bagging or AdaBoostM1. We discussed ensemble learning that runs base classifiers multiple times. https://faculty.kutztown.edu/parson/fall2022/WekaChapter12.pptx tree -> J48 (Run J48 without an ensemble classifier.) meta -> Bagging with its classifier set to J48. meta -> AdaBoostM1 with its classifier set to J48. trees -> RandomForest, which uses RandomTree as its base classifier. Report only their kappa values below. Do any of the ensemble classifers show an improved kappa over J48? If so, which one(s)? YOUR ANSWER: J48 Kappa statistic N.nnnn Bagging(J48) Kappa statistic N.nnnn AdaBoostM1(J48) Kappa statistic N.nnnn RandomForest Kappa statistic N.nnnn Do any of the ensemble classifers show an improved kappa over J48? ---------------------------------------------------------------- The following lines of code from Jython script genmidi.py relate to details in Q9 and Q10: # Major modes (major 3rd) __IonianMode__ = [0, 2, 4, 5, 7, 9, 11, 12] # a.k.a. major scale __LydianMode__ = [0, 2, 4, 6, 7, 9, 11, 12] # fourth is sharp __MixolydianMode__ = [0, 2, 4, 5, 7, 9, 10, 12] # seventh is flat # Minor modes (minor 3rd) __AeolianMode__ = [0, 2, 3, 5, 7, 8, 10, 12] # natural minor scale __DorianMode__ = [0, 2, 3, 5, 7, 9, 10, 12] # sixth is sharp __Phrygian__ = [0, 1, 3, 5, 7, 8, 10, 12] # 2nd is flat # Locrian has Minor third, also known as a dimished mode because of flat 5th __LocrianMode__ = [0, 1, 3, 5, 6, 8, 10, 12] # 2nd is flat, 5th is flat # Chromatic is not a mode, it is just all the notes. __Chromatic__ = [i for i in range(0, 13)] __movementModes__ = [ # 4 entries per movement, 1 for each of 4 MIDI channels [__IonianMode__, __IonianMode__, __MixolydianMode__, __LydianMode__], [__AeolianMode__, __AeolianMode__, __DorianMode__, __Phrygian__], # Give the lead instrument Ionia in the dissonant section for added tension. [__IonianMode__, __LocrianMode__, __Chromatic__, __Chromatic__], [__IonianMode__, __IonianMode__, __MixolydianMode__, __LydianMode__], ] # reprise first movement in fourth movement __movementModeNames__ = [ ["Ionian", "Ionian", "Mixolydian", "Lydian"], ["Aeolian", "Aeolian", "Dorian", "Phrygian"], # Give the lead instrument Ionia in the dissonant section for added tension. ["Ionian", "Locrian", "Chromatic", "Chromatic"], ["Ionian", "Ionian", "Mixolydian", "Lydian"], ] # reprise first movement in fourth movement __RandomGenerators__ = [ # each instrument channel uses the same gen # There are 4 for 4 movements. __genGaussianClosure__(3), __genGaussianClosure__(3), __genUniformClosure__(), __genGaussianClosure__(4)] __TONIC__ = [7, 9, 7, 7] # Gmaj, Amin, G?, Gmaj by movement. __OCTAVE__ = [ # Range per channel, use uniform for this: [4, 5], [3, 4], [3, 4], [3, 5] ] __SUSTAIN__ = [2, 2, 4, 2] # per channel ---------------------------------------------------------------- Q9: Load CSC558assn4_train_fulllag.arff.gz and remove attributes notenum and tick, leaving 16 attributes that are not about individual notes and timing but that cover entire (movement, channel) sets of notes. Next, run Preprocess Filter -> unsupervised -> attribute -> NumericToNominal on all attributes. We are running this filter to get discrete cluster values because all numeric attributes in CSC558assn4_train_fulllag.arff.gz are discrete integers without floating-point fractions. Inspect all attributes in the Preprocess tab to ensure that they are nominal, i.e., set-valued. Finally, in the Cluster tab run SimpleKMeans with the numClusters parameter set to 4. Paste the resulting table below. How do the __TONIC__ values in genmidi.py above related to the per movement ttonic values in in Clusters 0 through 3? Ignore the "Full Data" column in Q9 and Q10. Which tmode values are correct and which are incorrect, if any, in relation to the (movement, channel) pairs in clusters 0 through 3 as compared to the generator parameters in __movementModeNames__ above? YOUR ANSWER: How do the __TONIC__ values in genmidi.py above related to the per movement ttonic values in in Clusters 0 through 3? Ignore the "Full Data" column in Q9 and Q10. Which tmode values are correct and which are incorrect, if any, in relation to the (movement, channel) pairs in clusters 0 through 3 as compared to the generator parameters in __movementModeNames__ above? Final cluster centroids: Cluster# Attribute Full Data 0 1 2 3 (n.n) (n.n) (n.n) (n.n) (n.n) ==================================================================== movement n n n n n channel n n n n n ttonic n n n n n lagNote_0 n n n n n lagNote_1 n n n n n lagNote_2 n n n n n lagNote_3 n n n n n lagNote_4 n n n n n lagNote_5 n n n n n lagNote_6 n n n n n lagNote_7 n n n n n lagNote_8 n n n n n lagNote_9 n n n n n lagNote_10 n n n n n lagNote_11 n n n n n tmode value value value value value ---------------------------------------------------------------- Q10: In the Preprocess tab remove all attributes except for movement, channel, ttonic, and tmode. There are 4 movements X 4 channels = 16 combinations of (movement, channel) with (tmode, ttonic). Set the numClusters parameter of SimpleKMeans to 16 and record these 16 associations. Do not paste the Weka table output of 16 columns. It is too wide. Ignore "Full Data" and paste these associations for the 16 clusters. Are they correct in relation to __movementModeNames__ and __TONIC__? Give details on any incorrect associations. YOUR ANSWER: Are they correct in relation to __movementModeNames__ and __TONIC__? Give details on any incorrect associations. (M, C) where M is movement [0, 3] and C is channel [0, 3]. (0, 0) <-> (TMODEVAL, TTONICVAL) (0, 1) <-> (TMODEVAL, TTONICVAL) (0, 2) <-> (TMODEVAL, TTONICVAL) (0, 3) <-> (TMODEVAL, TTONICVAL) (1, 0) <-> (TMODEVAL, TTONICVAL) (1, 1) <-> (TMODEVAL, TTONICVAL) (1, 2) <-> (TMODEVAL, TTONICVAL) (1, 3) <-> (TMODEVAL, TTONICVAL) (2, 0) <-> (TMODEVAL, TTONICVAL) (2, 1) <-> (TMODEVAL, TTONICVAL) (2, 2) <-> (TMODEVAL, TTONICVAL) (2, 3) <-> (TMODEVAL, TTONICVAL) (3, 0) <-> (TMODEVAL, TTONICVAL) (3, 1) <-> (TMODEVAL, TTONICVAL) (3, 2) <-> (TMODEVAL, TTONICVAL) (3, 3) <-> (TMODEVAL, TTONICVAL) ----------------------------------------------------------------