******************************************************************* README_558_Assn2.txt CSC558 Fall 2024 Assignment 2. Each of Q1 through Q15 is worth 6.66% of the assignment. Please answer all questions, even if you need to guess one. It creates the opportunity for partial credit. A lack of an answer = 0% for that one. ******************************************************************* STUDENT NAME: PREFERRED PRONOUNS: ******************************************************************* Q1: In the Weka Select attributes tab, Choose CorrelationAttributeEval and click "Yes" if there is a pop-up. Start this correlation coefficent calculator and copy&paste all Ranked attributes with a CC (correlation coefficient) that is greater than 0.30 as displayed by Weka. Make sure to include the following header line. STUDENT ANSWER, Answer specific questions before pasting Weka output:: ******************************************************************* Q2: In the Classify tab run tree -> J48 with its default configuration parameters (a.k.a. hyperparameters) against the full dataset (78 attributes) and paste the following Weka output including the full decision tree. Do NOT paste Weka output that is not outlined here. What is the Landis and Koch rating for this kappa value? See https://faculty.kutztown.edu/parson/fall2019/Fall2019Kappa.html Inspect the confusion matrix. Instances of what class(es) per row were classified incorrectly as what (column attribute name)? STUDENT ANSWER, Answer specific questions before pasting Weka output:: ******************************************************************* Q3: Run the Preprocess unsupervised -> attribute filter RemoveUseless to get rid of the constant-valued attributes ampl1 and freq1, because constant-valued attributes do not distinguish between distinct values of a target attribute. Next, remove all attributes except the ones identified with a CC > 0.30 in Q1, making sure to keep twavetype along with these non-target attributes. Double check your removal to ensure that you have kept all of Q1's attributes and twavetype. Be careful that the character "l" (ELL) looks a lot like the digit "1" (ONE). You can UNDO if there are some missing or extras and try again. Save this dataset as CSC558Assn2Wavetype.arff.gz using the arff.gz file format. Run J48 as in Q2 and paste the same Weka output outlined below. What is the Landis and Koch rating for this kappa value? Did the kappa value decrease by more than 10%? (Related to Q4 re MDL). Inspect the confusion matrix. Instances of what class(es) per row were classified incorrectly as what (column attribute name)? STUDENT ANSWER, Answer specific questions before pasting Weka output:: ******************************************************************* Q4: Is the depth of the decision tree of Q3 DEEPER, SHALLOWER, or SAME as the decision tree of Q2? Determine depth by counting the highest number of "|" characters in a single row of the display tree and ADD 1 for the attribute that sits at the left column at the top of the tree. It has no "|". The goal for Minimum Description Length in a decision tree is to create a simpler tree that is easier to read, LESS depth in this case, without degrading the accuracy measure by more than 10% (kappa in this case). Does the decision tree of Q3 improve that of Q2 in terms of tree depth and readability? Justify your answer. STUDENT ANSWER, Answer specific questions before pasting Weka output:: ******************************************************************* Q5: Begin removing attributes listed in your output for Q1, one at a time, starting with the remaining attribute with the LEAST CC to twavetype as indicated by Select attributes -> CorrelationAttributeEval. After each removal run J48 and inspect the decision tree. Stop when the tree either becomes more SHALLOW (fewer "|" characters on the deepest line) or DEEPER (more "|" characters in the deepest row of the displayed decision tree). How deep is the tree? Is it more or less intelligible (readable) than the preceding decision trees before this last attribute was removed? What attribute was the final one removed? What non-target attributes remain? Copy and paste the following Weka output. STUDENT ANSWER, Answer specific questions before pasting Weka output:: ******************************************************************* Q6: Load CSC558Assn2Wavetype.arff.gz back into Weka, run J48, and verify to yourself that its kappa value matches that of Q3. Next run the following classifiers and record their kappa values here. Which ones, if any, improve the kappa value over J48 in Q3? How does the intelligibility (readability) of these kappa-improving models compare with J48's decision tree? Examine the RandomForest hyperparameters. How many trees does it run? (Use More to see definitions for the RandomForest hyperparameters.) Examine the RandomCommittee hyperparameters. What base classifier does it run? STUDENT ANSWER, Answer specific questions before pasting Weka output:: ******************************************************************* Q7: Load handout CSC223f24FRQDassn2.arff.gz into Weka and remove all tagged attributes except tduplication, leaving 78 attributes. Run filter unsupervised -> attribute -> NumericToNominal on tduplication's attribute index only. Check to ensure all other attributes remain numeric except for tduplication, which now consists of 3 discrete nominal bins. Save this dataset as CSC558Assn2Dupl.arff.gz. We are running NumericToNominal because tduplication's values are discrete values, not continuous floating-point values. In the Weka Select attributes tab, Choose CorrelationAttributeEval and click "Yes" if there is a pop-up. Start this correlation coefficent calculator and copy&paste the top 4 Ranked attributes as displayed by Weka. Make sure to include the following header line. STUDENT ANSWER, Answer specific questions before pasting Weka output:: ******************************************************************* Q8: Remove all attributes except target nominal target attribute tduplication and the top 4 non-target attributes of Q7. Report the kappa for the following classifiers with their default hyperparameters. Show the J48 and RandomTree decision trees in your answer. How does the reversal attribute appear to correlate with the tduplication in the trees, in other words, as reversal increases, does tduplication tend to increase or decrease? Note that both reversal and tduplication have integer values; decision tree decisions showing a test with a .5 boundary is just separating adjacent integer values. Inspect Figures 1 through 18 may help thinking about this answer. STUDENT ANSWER, Answer specific questions before pasting Weka output:: ******************************************************************* Q9: Load handout CSC223f24FRQDassn2.arff.gz into Weka and remove all tagged attributes except tgain, which is the amount of amplification (overall amplitude) of the time-domain signal on a scale of [0.25, 1.0] as scaled by the signal generator code, leaving 78 attributes. Run regressor functions -> SimpleLinearRegression and record these results including the one-term linear formula. What non-target attribute does SimpleLinearRegression select as the most important for target attribute tgain? What does the multiplier coefficient in the linear formula APPEAR TO BE? STUDENT ANSWER, Answer specific questions before pasting Weka output:: ******************************************************************* Q10: In the Preprocess tab, inspect values ranges for ampl1, freq1, ampl2, freq2, and some of the other attributes. What do you notice about ampl1 and freq1 value ranges that is different from the others? Now run filter unsupervised -> attribute -> Normalize, which, for each attribute except the target attribute, applies the formula ((attributeValue - minValue) / (maxValue - minValue)), where minValue and maxValue are the minimum and maximum values respectively in that attribute column. This form of nomalization computes each non-target attribute value as the fraction of its distance from minValue to maxValue. The purpose to put all non-target attributes on a single [0.0, 1.0] scale so we can compare their importance. Again run regressor functions -> SimpleLinearRegression and record these results including the one-term linear formula. What has changed about the linear formula? What has changed about the Correlation coefficient, Mean absolute error, and Root mean squared error metrics compared to Q9? Note that I added this non-target attribute extraction to the wave data after my first or second try on this project, to point out the effectiveness of analysis-based preparation of the data. I will discuss this when I go over the solution. STUDENT ANSWER, Answer specific questions before pasting Weka output:: ******************************************************************* Q11: At this point all non-target attribute values are in the range [0.0, 1.0] except ampl1, freq1, which are in the constant range [0.0, 0.0]. What additional information would you need to keep track of in order to get an individual non-target attribute value back to its original value. STUDENT ANSWER: ******************************************************************* Q12: In the Weka Select attributes tab, Choose CorrelationAttributeEval and click "Yes" if there is a pop-up. Start this correlation coefficent calculator and copy&paste the top 4 attributes RANKED BY THEIR ABSOLUTE VALUES as displayed by Weka. The top 4 will include positive and negative CCs. Rank them by their magnitude (absolute value). Make sure to include the following header line. STUDENT ANSWER, Answer specific questions before pasting Weka output:: ******************************************************************* Q13: Remove the non-target attribute identified by SimpleLinearRegression in Q9 and Q10. Run SimpleLinearRegression and paste the Weka output similar to Q10. What has changed about the simple linear formula? Why? How does the correlation coefficient of Q10 compare with Q13's? STUDENT ANSWER, Answer specific questions before pasting Weka output:: ******************************************************************* Q14: Load handout CSC223f24FRQDassn2.arff.gz into Weka and remove all tagged attributes except tfreq, which is the actual frequency of the fundamental signal in the wave fore being normalized to 1.0 as attribute freq1, leaving 78 attributes. Run filter unsupervised -> attribute -> Normalize as before. Run regressors SimpleLinearRegression, functions -> LinearRegression, and trees -> M5P and record the following CC and error measures. What are the "Number of Rules" for M5P? STUDENT ANSWER, Answer specific questions before pasting Weka output:: ******************************************************************* Q15: Go into M5P's configuration panel and increase the minNumInstances from the default 4 to 2500, forcing more instances into each leaf in the decision tree, making the tree more shallow. Run M5P and copy and paste the entire tree including the linear formulas at the leaves, the CC, and the error measures. What are the "Number of Rules"? Has the M5P model becoome more intelligible (readable) without losing more than 10% of its accuracy, compared to the M5P model of Q14? A < 10% reduction in model accuracy is my rough threshold for a Minimum Description Length (MDL) model. Explain your answer. STUDENT ANSWER, Answer specific questions before pasting Weka output:: *******************************************************************