*******************************************************************
README_558_Assn2.txt CSC558 Fall 2024 Assignment 2.
Each of Q1 through Q15 is worth 6.66% of the assignment.
Please answer all questions, even if you need to guess one.
It creates the opportunity for partial credit. A lack
of an answer = 0% for that one.
*******************************************************************
STUDENT NAME:           
PREFERRED PRONOUNS:     
*******************************************************************
Q1: In the Weka Select attributes tab, Choose CorrelationAttributeEval and
    click "Yes" if there is a pop-up. Start this correlation coefficent
    calculator and copy&paste all Ranked attributes with a CC
    (correlation coefficient) that is greater than 0.30 as displayed by Weka.
    Make sure to include the following header line.

STUDENT ANSWER, Answer specific questions before pasting Weka output::

*******************************************************************
Q2: In the Classify tab run tree -> J48 with its default configuration
    parameters (a.k.a. hyperparameters) against the full dataset
    (78 attributes) and paste the following Weka output including
    the full decision tree. Do NOT paste Weka output that is not outlined
    here. What is the Landis and Koch rating for this kappa value?
    See https://faculty.kutztown.edu/parson/fall2019/Fall2019Kappa.html
    Inspect the confusion matrix. Instances of what class(es) per row
    were classified incorrectly as what (column attribute name)?

STUDENT ANSWER, Answer specific questions before pasting Weka output::

*******************************************************************
Q3: Run the Preprocess unsupervised -> attribute filter RemoveUseless to get
    rid of the constant-valued attributes ampl1 and freq1, because
    constant-valued attributes do not distinguish between distinct values
    of a target attribute. Next, remove all attributes except the ones
    identified with a CC > 0.30 in Q1, making sure to keep twavetype along
    with these non-target attributes. Double check your removal to ensure that
    you have kept all of Q1's attributes and twavetype. Be careful that the
    character "l" (ELL) looks a lot like the digit "1" (ONE).
    You can UNDO if there are some missing or extras and try again.
    Save this dataset as CSC558Assn2Wavetype.arff.gz using the arff.gz
    file format. Run J48 as in Q2 and paste the same Weka output outlined below.
    What is the Landis and Koch rating for this kappa value?
    Did the kappa value decrease by more than 10%? (Related to Q4 re MDL).
    Inspect the confusion matrix. Instances of what class(es) per row
    were classified incorrectly as what (column attribute name)?

STUDENT ANSWER, Answer specific questions before pasting Weka output::

*******************************************************************
Q4: Is the depth of the decision tree of Q3 DEEPER, SHALLOWER, or SAME
    as the decision tree of Q2? Determine depth by counting the highest
    number of "|" characters in a single row of the display tree
    and ADD 1 for the attribute that sits at the left column at the top
    of the tree. It has no "|".
    The goal for Minimum Description Length in a decision tree is to
    create a simpler tree that is easier to read, LESS depth in this
    case, without degrading the accuracy measure by more than 10%
    (kappa in this case). Does the decision tree of Q3 improve that
    of Q2 in terms of tree depth and readability? Justify your answer.

STUDENT ANSWER, Answer specific questions before pasting Weka output::

*******************************************************************
Q5: Begin removing attributes listed in your output for Q1, one at a time,
    starting with the remaining attribute with the LEAST CC to twavetype
    as indicated by Select attributes -> CorrelationAttributeEval.
    After each removal run J48 and inspect the decision tree. Stop when the
    tree either becomes more SHALLOW (fewer "|" characters on the deepest line)
    or DEEPER (more "|" characters in the deepest row of the displayed
    decision tree). How deep is the tree? Is it more or less intelligible
    (readable) than the preceding decision trees before this last attribute
    was removed? What attribute was the final one removed?
    What non-target attributes remain? Copy and paste the following Weka output.

STUDENT ANSWER, Answer specific questions before pasting Weka output::

*******************************************************************
Q6: Load CSC558Assn2Wavetype.arff.gz back into Weka, run J48, and verify
    to yourself that its kappa value matches that of Q3. Next run the
    following classifiers and record their kappa values here.
    Which ones, if any, improve the kappa value over J48 in Q3?
    How does the intelligibility (readability) of these kappa-improving
    models compare with J48's decision tree?

    Examine the RandomForest hyperparameters. How many trees does it run?
    (Use More to see definitions for the RandomForest hyperparameters.)

    Examine the RandomCommittee hyperparameters.
    What base classifier does it run?

STUDENT ANSWER, Answer specific questions before pasting Weka output::

*******************************************************************
Q7: Load handout CSC223f24FRQDassn2.arff.gz into Weka and remove
    all tagged attributes except tduplication, leaving 78 attributes.
    Run filter unsupervised -> attribute -> NumericToNominal on tduplication's
    attribute index only. Check to ensure all other attributes remain
    numeric except for tduplication, which now consists of 3 discrete nominal
    bins. Save this dataset as CSC558Assn2Dupl.arff.gz. We are running
    NumericToNominal because tduplication's values are discrete values,
    not continuous floating-point values.
    In the Weka Select attributes tab, Choose CorrelationAttributeEval and
    click "Yes" if there is a pop-up. Start this correlation coefficent
    calculator and copy&paste the top 4 Ranked attributes as displayed by Weka.
    Make sure to include the following header line.

STUDENT ANSWER, Answer specific questions before pasting Weka output::

*******************************************************************
Q8: Remove all attributes except target nominal target attribute 
    tduplication and the top 4 non-target attributes of Q7. Report
    the kappa for the following classifiers with their default hyperparameters.
    Show the J48 and RandomTree decision trees in your answer.
    How does the reversal attribute appear to correlate with the 
    tduplication in the trees, in other words, as reversal increases,
    does tduplication tend to increase or decrease? Note that both reversal
    and tduplication have integer values; decision tree decisions showing
    a test with a .5 boundary is just separating adjacent integer values.
    Inspect Figures 1 through 18 may help thinking about this answer.

STUDENT ANSWER, Answer specific questions before pasting Weka output::

*******************************************************************
Q9: Load handout CSC223f24FRQDassn2.arff.gz into Weka and remove
    all tagged attributes except tgain, which is the amount of amplification
    (overall amplitude) of the time-domain signal on a scale of [0.25, 1.0]
    as scaled by the signal generator code, leaving 78 attributes.
    Run regressor functions -> SimpleLinearRegression and record these
    results including the one-term linear formula. What non-target attribute
    does SimpleLinearRegression select as the most important for target
    attribute tgain? What does the multiplier coefficient in the linear
    formula APPEAR TO BE?

STUDENT ANSWER, Answer specific questions before pasting Weka output::

*******************************************************************
Q10: In the Preprocess tab, inspect values ranges for ampl1, freq1,
     ampl2, freq2, and some of the other attributes. What do you notice
     about ampl1 and freq1 value ranges that is different from the others?
     Now run filter unsupervised -> attribute -> Normalize, which, for
     each attribute except the target attribute, applies the formula
     ((attributeValue - minValue) / (maxValue - minValue)), where
     minValue and maxValue are the minimum and maximum values respectively
     in that attribute column. This form of nomalization computes each
     non-target attribute value as the fraction of its distance from
     minValue to maxValue. The purpose to put all non-target attributes on
     a single [0.0, 1.0] scale so we can compare their importance.
     Again run regressor functions -> SimpleLinearRegression and record these
     results including the one-term linear formula. What has changed
     about the linear formula? What has changed about the Correlation
     coefficient, Mean absolute error, and Root mean squared error metrics
     compared to Q9? 

     Note that I added this non-target attribute extraction
     to the wave data after my first or second try on this project, to
     point out the effectiveness of analysis-based preparation of the data.
     I will discuss this when I go over the solution.

STUDENT ANSWER, Answer specific questions before pasting Weka output::

*******************************************************************
Q11: At this point all non-target attribute values are in the range
     [0.0, 1.0] except ampl1, freq1, which are in the constant range
     [0.0, 0.0]. What additional information would you need  to keep
     track of in order to get an individual non-target attribute value
     back to its original value.

STUDENT ANSWER:

*******************************************************************
Q12: In the Weka Select attributes tab, Choose CorrelationAttributeEval and
     click "Yes" if there is a pop-up. Start this correlation coefficent
     calculator and copy&paste the top 4 attributes RANKED BY THEIR
     ABSOLUTE VALUES as displayed by Weka. The top 4 will include positive
     and negative CCs. Rank them by their magnitude (absolute value).
     Make sure to include the following header line.

STUDENT ANSWER, Answer specific questions before pasting Weka output::

*******************************************************************
Q13: Remove the non-target attribute identified by SimpleLinearRegression in
     Q9 and Q10. Run SimpleLinearRegression and paste the Weka output
     similar to Q10. What has changed about the simple linear formula?
     Why? How does the correlation coefficient of Q10 compare with Q13's?

STUDENT ANSWER, Answer specific questions before pasting Weka output::

*******************************************************************
Q14: Load handout CSC223f24FRQDassn2.arff.gz into Weka and remove
    all tagged attributes except tfreq, which is the actual frequency
    of the fundamental signal in the wave fore being normalized to 1.0
    as attribute freq1, leaving 78 attributes. Run filter
    unsupervised -> attribute -> Normalize as before. Run regressors
    SimpleLinearRegression, functions -> LinearRegression, and
    trees -> M5P and record the following CC and error measures.
    What are the "Number of Rules" for M5P?

STUDENT ANSWER, Answer specific questions before pasting Weka output::

*******************************************************************
Q15: Go into M5P's configuration panel and increase the minNumInstances
     from the default 4 to 2500, forcing more instances into each leaf in
     the decision tree, making the tree more shallow. Run M5P and copy
     and paste the entire tree including the linear formulas at the leaves,
     the CC, and the error measures. 
     What are the "Number of Rules"? Has the M5P model becoome more
     intelligible (readable) without losing more than 10% of its accuracy,
     compared to the M5P model of Q14? A < 10% reduction in model
     accuracy is my rough threshold for a Minimum Description Length (MDL)
     model. Explain your answer.  

STUDENT ANSWER, Answer specific questions before pasting Weka output::

*******************************************************************