README.assn3.txt, CSC558 Fall 2024, Assignment 3
Due by 11:59 PM on Saturday November 2 via D2L Assignment 3.
There is the usual 10% reduction per day late and I cannot accept submissions
after I go over the solution in class.

STUDENT NAME: Add your name here.

Keep the format below the same and just add your answers below the questions.
Each answer Q1-Q10 below is worth 10% of the project grade, 100% total.
Keep the prompt "YOUR ANSWER:" above your answers so I can search on that.

READ THE FOLLOWING OVERVIEW AND ASK QUESTIONS AS NEEDED.
This is an overview of the application but Q1-Q10 do not really
require this application domain knowledge. We will go over lagged
data in class.

The following attributes are in the FULL training and
testing non-target datasets, with movement being the portion of
the concert in range [0,3], channel being the MIDI musician also in
range [0,3] -- there are 4 musicians playing in each of the 4 movements --
notenum being the distance between the actual note and its tonic "do note",
tick being the time within the movement when the note is played,
ttonic being the extracted "do note (as in do re mi fa so la ti do)",
where the "do" note is the most frequent actual note in a
(movement, channel)'s sequence of MIDI notes.

Each of lagNote_0 through lagNote_11 is the lagged notenum from up to
previous 12 instances (determined by measuring maximum kappa
fed back from _main.py module) within the current (movement, channel),
lagged by previous notes. "lagNote_0" is the count of tonic notes within the
lag period, and "lagNote_11" is the count of "major 7ths" within the lag
period. You do not need to understand music theory, only the fact that
the distribution of counts in the lagNote_0..lagNote_11 correlates
to the tmode (scale) which is the target attribute. Knowing attributes
movement, channel, notenum, tick, and ttonic is like reading the
musical score.

From the Python script that aggregates movement-channel
note sequences into the lagged attributes:
# From a "backInTime" time lag of 0 up through "backInTime" of -12,
# copy lagged preceding (in time) normalized (w.r.t. extracted tonic)
# into the current instance, stopping before "bakInTime" of -12 if
# a peak in kappa is passed, i.e., passing point of diminishing
# returns on kappa. Derive two "kappa-optimized" datasets as follows.
#
# CSC558assn3_train_fullag.arff.gz, CSC558assn3_test_fullag.arff.gz
# use all attributes including  movement, channel, ticks, etc.,
# through the lagged note histograms and then target attribute tmode.

For all except the Chromatic target tmode, the order of importance
of lagged note counts in determining the tmode should be:

lagNote_0       # "do" note, i.e., tonic or root
lagNote_7       # "so" note which is the musical 5th except Locrian mode
lagNote_4       # "mi" note, the 3rd in major modes
lagNote_5       # "fa" note (4th) in major & minor modes except Lydian
lagNote_11      # "ti" 7th note in major scales
lagNote_3       # "mi" note, the 3rd in minor modes
lagNote_10      # "ti" 7th note in minor scales
lagNote_9       # "la" 6th note in major scales and Dorian
lagNote_6       # sharp 4th in Lydian mode, flat 5th in Locrian mode
lagNote_8       # "la" 6th note in Aeolian, Phrygian, and Locrian minor modes
lagNote_2       # "re" 2nd interval in all except Dorian and Locrian
lagNote_1       # "re" 2nd interval in Dorian and Locrian modes

The reason for enumerating these lagNote counters is to point out that
the intervals near the top of that list are more consistent in predicting
the target tmode than the ones near the bottom, which are consistent for
some modes but not others. The Chromatic mode with uniformly-distributed
notes in this dataset shows no such counter pattern.

This is the FULL set of attributes:
@attribute movement numeric     # All musicians play together in a movement.
@attribute channel numeric      # A musician plays a MIDI channel per movement.
@attribute notenum numeric      # The is the interval from the ttonic, where
                                # the ttonic is the most frequently played
                                # MIDI note per a (movement,channel) pair.
                                # "Interval" is the distance from ttonic.
@attribute tick numeric         # discrete time step in (movement,channel).
@attribute ttonic numeric       # "do" note derived from freqency of appearance.
# Following are in order of attribute columns.

lagNote_0       # "do" note, i.e., tonic or root
lagNote_1       # "re" 2nd interval in Dorian and Locrian modes
lagNote_2       # "re" 2nd interval in all except Dorian and Locrian
lagNote_3       # "mi" note, the 3rd in minor modes
lagNote_4       # "mi" note, the 3rd in major modes
lagNote_5       # "fa" note (4th) in major & minor modes except Lydian
lagNote_6       # sharp 4th in Lydian mode, flat 5th in Locrian mode
lagNote_7       # "so" note which is the musical 5th except Locrian mode
lagNote_8       # "la" 6th note in Aeolian, Phrygian, and Locrian minor modes
lagNote_9       # "la" 6th note in major scales and Dorian
lagNote_10      # "ti" 7th note in minor scales
lagNote_11      # "ti" 7th note in major scales
tmode           # is the TAGGED, target attribute we are predicting.
                # "mode" and "scale" are equivalent in this assignment.

The following attributes are in the MIN training and
testing non-target datasets after movement..ttonic are projected away.
These are the notes an attendee would actually hear within a given
movement from a given channel-musician.

This is the MIN set of lagged normalized (with respect to tonic intervals)
attributes:
@attribute lagNote_0 numeric
@attribute lagNote_1 numeric
@attribute lagNote_2 numeric
@attribute lagNote_3 numeric
@attribute lagNote_4 numeric
@attribute lagNote_5 numeric
@attribute lagNote_6 numeric
@attribute lagNote_7 numeric
@attribute lagNote_8 numeric
@attribute lagNote_9 numeric
@attribute lagNote_10 numeric
@attribute lagNote_11 numeric
@attribute tmode 
----------------------------------------------------------------
Q1: After making sure you are using CSC558assn4_train_fulllag.arff.gz
    for training and CSC558assn4_test_fulllag.arff.gz for classification
    testing (per Classify tab "Supplied test set", run the classifier
    NaiveBayes under Choose -> classifiers -> bayes with its default config
    parameters. Record the following values using copy (control-C) from Weka
    then paste, pasting ONLY the fields from Weka in the templates below.

YOUR ANSWER:

NaiveBayes:

Correctly Classified Instances        NNNN               N.nnnn %
Incorrectly Classified Instances      NNNN                N.nnnn %
Kappa statistic                          N.nnnn
Mean absolute error                      N.nnnn
Root mean squared error                  N.nnnn  
Relative absolute error                  N.nnnn %
Root relative squared error             N.nnnn %
Total Number of Instances             4154     

=== Confusion Matrix ===

    a    b    c    d    e    f    g    h   <-- classified as
    n    n    n    n    n    n    n    n |    a = attributeName
    n    n    n    n    n    n    n    n |    b = attributeName
    n    n    n    n    n    n    n    n |    c = attributeName
    n    n    n    n    n    n    n    n |    d = attributeName
    n    n    n    n    n    n    n    n |    e = attributeName
    n    n    n    n    n    n    n    n |    f = attributeName
    n    n    n    n    n    n    n    n |    g = attributeName
    n    n    n    n    n    n    n    n |    h = attributeName

----------------------------------------------------------------
Q2: List in the table below the misclassified instances from
    the confusion matrix including the counts. Start at the top
    row of the confusion matrix and work your way down in your
    answer. If more than 1 misclassification appears in a row,
    work left to right in that row. If a row has no incorrect
    classifications, skip it.

YOUR ANSWER:

CORRECT CLASS VALUE             INCORRECT CLASSIFIED AS     COUNT

rowName                         columnName                  n
...

----------------------------------------------------------------
Q3: In the Preprocess tab remove attributes movement, channel, notenum, tick,
    and tonic, leaving attributes lagNote_0 through lagNote_11, which
    are counts of one octave of MIDI notes normalized as distance from
    the extracted tonic ("do" note), along with target attribute tmode.
    You may want to save this as a working file such as lagged.arff
    so you do not need to keep removing attributes in later steps.
    Here are the attributes:

Instances:    4154
Attributes:   13
              lagNote_0
              lagNote_1
              lagNote_2
              lagNote_3
              lagNote_4
              lagNote_5
              lagNote_6
              lagNote_7
              lagNote_8
              lagNote_9
              lagNote_10
              lagNote_11
              tmode

Run these two classifiers with this reduced attribute set and record
these results as before. NOTE that the first time you run tests, Weka
may pop up a question about mapping the training attributes to the test
data, since the test data file contains attributes movement, channel,
notenum, tick, and tonic. Just accept the mapping Weka proposes.
It ignores the removed training attributes during testing.

Attribute mappings:

Model attributes            Incoming attributes
----------------------      ----------------
(numeric) lagNote_0     --> 6 (numeric) lagNote_0
(numeric) lagNote_1     --> 7 (numeric) lagNote_1
(numeric) lagNote_2     --> 8 (numeric) lagNote_2
(numeric) lagNote_3     --> 9 (numeric) lagNote_3
(numeric) lagNote_4     --> 10 (numeric) lagNote_4
(numeric) lagNote_5     --> 11 (numeric) lagNote_5
(numeric) lagNote_6     --> 12 (numeric) lagNote_6
(numeric) lagNote_7     --> 13 (numeric) lagNote_7
(numeric) lagNote_8     --> 14 (numeric) lagNote_8
(numeric) lagNote_9     --> 15 (numeric) lagNote_9
(numeric) lagNote_10    --> 16 (numeric) lagNote_10
(numeric) lagNote_11    --> 17 (numeric) lagNote_11
(nominal) tmode         --> 18 (nominal) tmode

YOUR ANSWER:

NaiveBayes:

Correctly Classified Instances        NNNN               N.nnnn %
Incorrectly Classified Instances      NNNN                N.nnnn %
Kappa statistic                          N.nnnn
Mean absolute error                      N.nnnn
Root mean squared error                  N.nnnn  
Relative absolute error                  N.nnnn %
Root relative squared error             N.nnnn %
Total Number of Instances             4154     

NaiveBayesMultinomial:

Correctly Classified Instances        NNNN               N.nnnn %
Incorrectly Classified Instances      NNNN                N.nnnn %
Kappa statistic                          N.nnnn
Mean absolute error                      N.nnnn
Root mean squared error                  N.nnnn  
Relative absolute error                  N.nnnn %
Root relative squared error             N.nnnn %
Total Number of Instances             4154     

----------------------------------------------------------------
Q4: What accounts for the higher kappa of NaiveBayesMultinomial over
    NaiveBayes in Q3, given the outline of NaiveBayesMultinomial
    in its introductory description paragraph here:
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

YOUR ANSWER:

----------------------------------------------------------------
Q5: Restore the full set of attributes 18 Attributes:
    movement, channel, notenum, tick, ttonic, lagNote_0, lagNote_1,
    lagNote_2, lagNote_3, lagNote_4, lagNote_5, lagNote_6, lagNote_7,
    lagNote_8, lagNote_9, lagNote_10, lagNote_11, tmode

by re-loading CSC558assn4_train_fulllag.arff.gz.
Make sure CSC558assn4_test_fulllag.arff.gz is set for testing in the
Classify tab. Always check that.

Run classifier trees -> J48 with the default config parameters.
Weka sets the "minNumObj" parameter that determines how many instances
must reach each leaf in the tree to 2. I sometimes bump it up to 50 or
100 or a higher fraction of the total instances to get a more shallow
and intelligible tree, usually at the cost of some kappa accuracy, but
for these data leaving minObj at the default of 2 is OK.

RECORD ONLY of the following Weka output below, including the FULL
DECISION TREE and the CONFUSION MATRIX in addition to the error measures:

YOUR ANSWER:

J48 pruned tree
------------------
The FULL DECISION TREE appears here. Paste it in.

Number of Leaves  :     N
Size of the tree :  N
Time taken to build model: N.n seconds

Correctly Classified Instances        NNNN               N.nnnn %
Incorrectly Classified Instances      NNNN                N.nnnn %
Kappa statistic                          N.nnnn
Mean absolute error                      N.nnnn
Root mean squared error                  N.nnnn  
Relative absolute error                  N.nnnn %
Root relative squared error             N.nnnn %
Total Number of Instances             4154     

=== Confusion Matrix ===

    a    b    c    d    e    f    g    h   <-- classified as
    n    n    n    n    n    n    n    n |    a = attributeName
    n    n    n    n    n    n    n    n |    b = attributeName
    n    n    n    n    n    n    n    n |    c = attributeName
    n    n    n    n    n    n    n    n |    d = attributeName
    n    n    n    n    n    n    n    n |    e = attributeName
    n    n    n    n    n    n    n    n |    f = attributeName
    n    n    n    n    n    n    n    n |    g = attributeName
    n    n    n    n    n    n    n    n |    h = attributeName

----------------------------------------------------------------
Q6: In your confusion matrix for Q5, which target ttonic attribute value
was most often misclassified? How was it actually classified?

YOUR ANSWER:

----------------------------------------------------------------
Q7: In your decision tree for Q5, what are the two primary attributes
    used to make classfication decisions? These are the two non-target
    attributes closest to the root of the tree, appearing furthest to
    the left in Weka's tree dump. Leaf node predictions appear to the right.

YOUR ANSWER:

----------------------------------------------------------------
Q8: Make sure you have all 18 training attributes movement through tmode
    in the Preprocess tab and CSC558assn4_test_fulllag.arff.gz is set
    for testing in the Classify tab. Run the following 3 ensemble classifiers
    with the base classifier set as follows. Do not use the default
    classifier parameter for Bagging or AdaBoostM1.
    We discussed ensemble learning that runs base classifiers multiple
    times.
    https://faculty.kutztown.edu/parson/fall2022/WekaChapter12.pptx

tree -> J48 (Run J48 without an ensemble classifier.)
meta -> Bagging with its classifier set to J48.
meta -> AdaBoostM1 with its classifier set to J48.
trees -> RandomForest, which uses RandomTree as its base classifier.

Report only their kappa values below.
Do any of the ensemble classifers show an improved kappa over J48?
    If so, which one(s)?

YOUR ANSWER:

J48 Kappa statistic                                   N.nnnn
Bagging(J48) Kappa statistic                          N.nnnn
AdaBoostM1(J48) Kappa statistic                       N.nnnn
RandomForest Kappa statistic                          N.nnnn

Do any of the ensemble classifers show an improved kappa over J48?

----------------------------------------------------------------
The following lines of code from Jython script genmidi.py relate
to details in Q9 and Q10:

# Major modes (major 3rd)
__IonianMode__     = [0, 2, 4, 5, 7, 9, 11, 12]     # a.k.a. major scale
__LydianMode__     = [0, 2, 4, 6, 7, 9, 11, 12]     # fourth is sharp
__MixolydianMode__ = [0, 2, 4, 5, 7, 9, 10, 12]     # seventh is flat
# Minor modes (minor 3rd)
__AeolianMode__    = [0, 2, 3, 5, 7, 8, 10, 12]     # natural minor scale
__DorianMode__     = [0, 2, 3, 5, 7, 9, 10, 12]     # sixth is sharp
__Phrygian__       = [0, 1, 3, 5, 7, 8, 10, 12]     # 2nd is flat
# Locrian has Minor third, also known as a dimished mode because of flat 5th
__LocrianMode__    = [0, 1, 3, 5, 6, 8, 10, 12]     # 2nd is flat, 5th is flat
# Chromatic is not a mode, it is just all the notes.
__Chromatic__      = [i for i in range(0, 13)]

__movementModes__ = [   # 4 entries per movement, 1 for each of 4 MIDI channels
    [__IonianMode__, __IonianMode__, __MixolydianMode__, __LydianMode__],
    [__AeolianMode__, __AeolianMode__, __DorianMode__, __Phrygian__],
    # Give the lead instrument Ionia in the dissonant section for added tension.
    [__IonianMode__, __LocrianMode__, __Chromatic__, __Chromatic__],
    [__IonianMode__, __IonianMode__, __MixolydianMode__, __LydianMode__],
]   # reprise first movement in fourth movement

__movementModeNames__ = [
    ["Ionian", "Ionian", "Mixolydian", "Lydian"],
    ["Aeolian", "Aeolian", "Dorian", "Phrygian"],
    # Give the lead instrument Ionia in the dissonant section for added tension.
    ["Ionian", "Locrian", "Chromatic", "Chromatic"],
    ["Ionian", "Ionian", "Mixolydian", "Lydian"],
]   # reprise first movement in fourth movement

__RandomGenerators__ = [ # each instrument channel uses the same gen
    # There are 4 for 4 movements.
    __genGaussianClosure__(3), __genGaussianClosure__(3),
    __genUniformClosure__(), __genGaussianClosure__(4)]

__TONIC__ = [7, 9, 7, 7]       # Gmaj, Amin, G?, Gmaj by movement.
__OCTAVE__ = [      # Range per channel, use uniform for this:
    [4, 5], [3, 4], [3, 4], [3, 5]
]
__SUSTAIN__ = [2, 2, 4, 2]    # per channel
----------------------------------------------------------------
Q9: Load CSC558assn4_train_fulllag.arff.gz and remove attributes
    notenum and tick, leaving 16 attributes that are not about individual
    notes and timing but that cover entire (movement, channel) sets
    of notes. Next, run Preprocess Filter -> unsupervised -> attribute
    -> NumericToNominal on all attributes. We are running this filter
    to get discrete cluster values because all numeric attributes in
    CSC558assn4_train_fulllag.arff.gz are discrete integers without
    floating-point fractions. Inspect all attributes in the Preprocess
    tab to ensure that they are nominal, i.e., set-valued. Finally,
    in the Cluster tab run SimpleKMeans with the numClusters parameter
    set to 4. Paste the resulting table below.

How do the __TONIC__ values in genmidi.py above related to the per
movement ttonic values in in Clusters 0 through 3? Ignore the "Full Data"
column in Q9 and Q10.

Which tmode values are correct and which are incorrect, if any,
in relation to the (movement, channel) pairs in clusters 0 through 3
as compared to the generator parameters in __movementModeNames__ above?

YOUR ANSWER:

How do the __TONIC__ values in genmidi.py above related to the per
movement ttonic values in in Clusters 0 through 3? Ignore the "Full Data"
column in Q9 and Q10.

Which tmode values are correct and which are incorrect, if any,
in relation to the (movement, channel) pairs in clusters 0 through 3
as compared to the generator parameters in __movementModeNames__ above?

Final cluster centroids:
                           Cluster#
Attribute      Full Data          0          1          2          3
                (n.n)           (n.n)      (n.n)       (n.n)     (n.n)   
====================================================================
movement               n          n          n          n          n
channel                n          n          n          n          n
ttonic                 n          n          n          n          n
lagNote_0              n          n          n          n          n
lagNote_1              n          n          n          n          n
lagNote_2              n          n          n          n          n
lagNote_3              n          n          n          n          n
lagNote_4              n          n          n          n          n
lagNote_5              n          n          n          n          n
lagNote_6              n          n          n          n          n
lagNote_7              n          n          n          n          n
lagNote_8              n          n          n          n          n
lagNote_9              n          n          n          n          n
lagNote_10             n          n          n          n          n
lagNote_11             n          n          n          n          n
tmode              value      value      value      value      value

----------------------------------------------------------------
Q10: In the Preprocess tab remove all attributes except for movement,
channel, ttonic, and tmode. There are 4 movements X 4 channels = 16
combinations of (movement, channel) with (tmode, ttonic). Set the 
numClusters parameter of SimpleKMeans to 16 and record these 16
associations. Do not paste the Weka table output of 16 columns.
It is too wide. Ignore "Full Data" and paste these associations for the
16 clusters. Are they correct in relation to __movementModeNames__ and
__TONIC__? Give details on any incorrect associations.

YOUR ANSWER:

Are they correct in relation to __movementModeNames__ and __TONIC__?
Give details on any incorrect associations.

(M, C)      where M is movement [0, 3] and C is channel [0, 3].
(0, 0) <-> (TMODEVAL, TTONICVAL)
(0, 1) <-> (TMODEVAL, TTONICVAL)
(0, 2) <-> (TMODEVAL, TTONICVAL)
(0, 3) <-> (TMODEVAL, TTONICVAL)
(1, 0) <-> (TMODEVAL, TTONICVAL)
(1, 1) <-> (TMODEVAL, TTONICVAL)
(1, 2) <-> (TMODEVAL, TTONICVAL)
(1, 3) <-> (TMODEVAL, TTONICVAL)
(2, 0) <-> (TMODEVAL, TTONICVAL)
(2, 1) <-> (TMODEVAL, TTONICVAL)
(2, 2) <-> (TMODEVAL, TTONICVAL)
(2, 3) <-> (TMODEVAL, TTONICVAL)
(3, 0) <-> (TMODEVAL, TTONICVAL)
(3, 1) <-> (TMODEVAL, TTONICVAL)
(3, 2) <-> (TMODEVAL, TTONICVAL)
(3, 3) <-> (TMODEVAL, TTONICVAL)

----------------------------------------------------------------