nCSC 458 - Data Mining & Predictive Analytics I, Spring 2024, Assignment 5 on Bayesian & Instance-based Analysis and Clustering.

ADDED a Python script to analyze monotonic cluster sequences for Q11 in this zip file.
    We will go over it on May 8 in class.


Assignment 5 due 11:59 PM Thursday May 2 via D2L Assignment 5.

We will walk through this spec and allot at least an hour of class work time.

Q1 through Q11 in README.assn5.txt are worth 8% each and a correct CSC458assn5.arff.gz file is worth 12%. There is a 10% penalty for each day it is late to D2L.
 
1. To get the assignment:
Download compressed ARFF data file month_HM_reduced_aggregate.arff.gz and Q&A file README.assn5.txt from these links.
The ARFF raptor data are the same as for Assignment 3. We are going to compare new-to-us models to those of Assignment 3.
You must answer questions in README.assn5.txt and save & later turn in working file CSC458assn5.arff.gz.
Each answer for Q1 through Q11 in README.assn5.txt is worth 8 points, and
CSC458assn5.arff.gz with correct contents is worth 12%, totaling 100%. There is a 10% late penalty for each day the assignment is late, and it needs to be in before I go over my solution on May 8 to earn points.

2. Weka and README.assn5.txt operations
Start Weka, bring up the Explorer GUI, and open month_HM_reduced_aggregate.arff.gz.
    Set Files of Type at the bottom of the Open window to (*.arff.gz) to see the input ARFF file. Double click it.
    Alternatively you can gunzip
month_HM_reduced_aggregate.arff.gz to get text file month_HM_reduced_aggregate.arff.

This ARFF file has 45 attributes (columns) and 226 instances (rows) of monthly aggregate data from August through December of 1976 through 2021.
    Here are the attributes in the file. It is a monthly aggregate of daily aggregates of (mostly 1-hour) observation periods.

year                                                    1976-2021
month                                                  8-12
HMtempC_mean                                 mean for month of temp Celsius during observation times
WindSpd_mean                                  same for wind speed in km/hour
HMtempC_median                              median for month
WindSpd_median
HMtempC_pstdv                                population standard deviation
WindSpd_pstdv
HMtempC_min                                   minimum & maximum
WindSpd_min
HMtempC_max
WindSpd_max
wndN                                                 tally of North winds for all observations in the month, etc.
wndNNE
wndNE
wndENE
wndE
wndESE
wndSE
wndSSE
wndS
wndSSW
wndSW
wndWSW
wndW
wndWNW
wndNW
wndNNW
wndUNK
HMtempC_24_mean                    Changes in magnitude (absolute value of change) over 24, 48, and 72 hours
HMtempC_48_mean
HMtempC_72_mean
HMtempC_24_median
HMtempC_48_median
HMtempC_72_median
HMtempC_24_pstdv
HMtempC_48_pstdv
HMtempC_72_pstdv
HMtempC_24_min                    The min & max are their signed values.
HMtempC_48_min
HMtempC_72_min
HMtempC_24_max
HMtempC_48_max
HMtempC_72_max
SS_All                                        Tally of sharp-shinned hawk observations during each month 8-12, 1976-2021. Target attribute.

You can examine its contents and sort based on attributes by opening the file in the Preprocess Edit button.

2a. In the Preprocess tab open Filter -> unsupervised -> attribute -> AddExpression and add the following 4 derived attributes.

    Enter name SS_All_Range10 with an expression aN, where N is the attribute number of SS_All, which is our primary target attribute. (Use a45 for attribute 45, not "45" or "aN".) Apply.

        Be careful NOT TO INCLUDE SPACES within your derived attribute names.

    Enter name SS_All_Log10 with an expression log(aN+1)/log(10), where N is the attribute number of SS_All. Apply.
        This step compresses SS_All.
        The reason for adding +1 to aN is to avoid taking the log(0) for SS_All counts of 0, which is undefined. None will be negative.

    Enter name SS_All_Log10_Range10 with an expression aN, where N is the attribute number of SS_All_Log10 , which is target attribute just derived. Apply.
 
    At this point you have 48 attributes.


2b. Select Filter -> unsupervised -> attribute -> Discretize to chop SS_All_Range10 into 10 discrete classes as follows.
      Then Discretize SS_All_Log10_Range10 using the same steps.

    Set the Discretize attributeIndices to the index of SS_All_Range10, leave the other Discretize parameters at their defaults (useEqualFrequency is False), and Apply.


Set the Discretize attributeIndices to the index of SS_All_Log10_Range10, change config parameter ignoreClass to True (because Weka considers this attribute to be the target attribute at this point),leave the other Discretize parameters at their defaults (useEqualFrequency is False), and Apply.

SS_All_Range10
and
SS_All_Log10_Range10 are the only nominal attributes at this point. The remaining 46 are numeric. Please verify this.

Save this dataset as CSC458assn5.arff.gz, making sure "Files of Type" is set to (*.arff.gz) and the file is named correctly. It has 48 attributes.Turn in CSC458assn5.arff.gz when your turn in your README.assn5.txt to D2L. Make sure to use only one variant of SS_All or its derived attributes
during regression, classification, or clustering. Predicting one variant of the target attribute as a function of another target variant is a mistake.

2c
. Remove derived attributes SS_All_Range10, SS_All_Log10 and SS_All_Log10_Range10 so that SS_All is the only target attribute, following non-target attribute HMtempC_72_max. We will use them later.

REGRESSION:

Q1: In the Classify TAB run lazy -> IBk with its default configuration
parameters ("meta-parameters") and fill in these numbers N and N.N
below by sweeping the output with your mouse, hitting control-C to copy,
and then pasting into README.assn5.txt, just like Assignment 3. Make
sure to include "using N nearest neighbor(s) from Weka's report.

IB1 instance-based classifier
using N nearest neighbour(s) for classification
...
Correlation coefficient                 N.N  
Mean absolute error                   N.N
Root mean squared error               N.N
Relative absolute error                N      %
Root relative squared error            N      %
Total Number of Instances              226

PREP FOR Q2: Repeated change the KNN configuration parameter for IBk, run it again, until
you have the KNN (number of nearest training neighbors for which to average their SS_All) values
until you have the maximum correlation coefficient (CC). I did this two ways, First, I incremented
KNN 1 at a time until I hit a CC peak that was sustained through the following eight higher KNN values.
The first peak may be exceeded later. I will talk about "hill climbing" algorithms in class.
Second, I tried binary search, where I set KNN to halfway points between the previous two
highest KNN values in terms of CC, but this was no less tedious, and harder to keep track of.
I recommend the first approach. An incorrect peak KNN value in terms of CC in your answer
will lose points.

Q2: After finding the peak correlation coefficient (CC) for IBk's
KNN (K nearest neighbors) configuration parameter per instructions
in the handout, paste the following Weka output, again including
"using N nearest neighbor(s)"

IB1 instance-based classifier
using N nearest neighbour(s) for classification
...

Correlation coefficient                 N.N  
Mean absolute error                   N.N
Root mean squared error               N.N
Relative absolute error                N      %
Root relative squared error            N      %
Total Number of Instances              226 

Q3: Load
CSC458assn5.arff.gz into Weka or execute UNDO until SS_All_Log10 is back. Remove the other SS_All-derived attributes
and run IBk with the KNN value for peak CC that you determined in Q2. Print the following measures. What accounts for the
change in CC? (Refer to the Assignment 3 analyses.)


IB1 instance-based classifier
using N nearest neighbour(s) for classification
...

Correlation coefficient                 N.N  
Mean absolute error                   N.N
Root mean squared error               N.N
Relative absolute error                N      %
Root relative squared error            N      %
Total Number of Instances              226
 
Q4:  Take the KNN parameter for peak CC of Q2 and reduce it
1 at a time, not more than 10 consecutive times. Do the same
for Q2's KNN but incrementing it 1 at a time for no more than
10 times. What is the KNN now in terms of peak CC value?
Did KNN (number of nearest neighbor training instances) increase
or decrease from Q2's value? Given what you know about logarithmic
compression from Assignment 3, why do you think this KNN value
made this change from Q2?


IB1 instance-based classifier
using N nearest neighbour(s) for classification
...

Correlation coefficient                 N.N  
Mean absolute error                   N.N
Root mean squared error               N.N
Relative absolute error                N      %
Root relative squared error            N      %
Total Number of Instances              226

Q5: Using the KNN value from Q4 for the peak CC, run IBk with that
KNN value against SS_All_Range10 and then SS_All_Log10_Range10
in turn, making sure that no other SS_All-derived values are in
the data when classifying. Record these values. How does the
kappa in each of them rate in terms of Landis & Koch categories?
https://faculty.kutztown.edu/parson/fall2019/Fall2019Kappa.html

SS_All_Range10:

IB1 instance-based classifier
using N nearest neighbour(s) for classification
...
Correctly Classified Instances         N               N.n %
Incorrectly Classified Instances        N               N.n %
Kappa statistic                          N.n
Mean absolute error                      N.n
Root mean squared error                  N.n
Relative absolute error                 N.n %
Root relative squared error             N.n %
Total Number of Instances              226

SS_All_Log10_Range10:

IB1 instance-based classifier
using N nearest neighbour(s) for classification
...
Correctly Classified Instances         N               N.n %
Incorrectly Classified Instances        N               N.n %
Kappa statistic                          N.n
Mean absolute error                      N.n
Root mean squared error                  N.n
Relative absolute error                 N.n %
Root relative squared error             N.n %
Total Number of Instances              226

Q6: From here through Q8 we will classify with SS_All_Log10_Range10
as the target attribute and no other SS_All-derived attributes
in the dataset. Run classifier bayes -> NaiveBayes against this
data and record these results. How does it compare to Q5's
kappa result. What is its Landis & Koch kappa range?

Correctly Classified Instances         N               N.n %
Incorrectly Classified Instances        N               N.n %
Kappa statistic                          N.n
Mean absolute error                      N.n
Root mean squared error                  N.n
Relative absolute error                 N.n %
Root relative squared error             N.n %
Total Number of Instances              226

PREP FOR REMAINING STEPS:

IBk (KNN) is sensitive to less-important non-target attributes contributing to
the distance-between-training-instances distance metric equally with important
ones. NaiveBayes is sensitive to partially redundant attributes skewing predictions
as outlined in the March 6 Zoom recording of class. We would like to eliminate both.

Q7: In the Select attributes Weka tab, Choose CorrelationAttributeEval and
accept the Ranker pop-up for Search Method. Run this and list the top 12 in
descending order of CC:

Attribute Evaluator (supervised, Class (nominal): 45 SS_All_Log10_Range10):
    Correlation Ranking Filter
Ranked attributes:
 N.n    N ?
 N.n    N ?
 N.n    N ?
 N.n    N ?
 N.n    N ?
 N.n    N ?
 N.n    N ?
 N.n    N ?
 N.n    N ?
 N.n    N ?
 N.n    N ?
 N.n    N ?

Q8: Remove all attributes except the target SS_All_Log10_Range10
and the top 12 of Q7 in terms of CC to SS_All_Log10_Range10,
leaving 13 attributes. Run IBk with the KNN of Q4 and Q5, NaiveBayes,
and OneR, pasting its decision rule as well as the other measures.
Which of these 3 does the best in terms of kappa? Have IBk and
NaiveBayes improved or degraded in terms of kappa from their Q5
and Q6 counterparts respectively?

IBk:

IB1 instance-based classifier
using N nearest neighbour(s) for classification
...
Correctly Classified Instances         N               N.n %
Incorrectly Classified Instances       N               N.n %
Kappa statistic                          N.n
Mean absolute error                      N.n
Root mean squared error                  N.n
Relative absolute error                 N.n %
Root relative squared error             N.n %
Total Number of Instances              226

NaiveBayes:

Correctly Classified Instances         N               N.n %
Incorrectly Classified Instances       N               N.n %
Kappa statistic                          N.n
Mean absolute error                      N.n
Root mean squared error                  N.n
Relative absolute error                 N.n  %
Root relative squared error             N.n %
Total Number of Instances              226

OneR:

=== Classifier model (full training set) ===

RULE GOES HERE

Correctly Classified Instances         N               N.n %
Incorrectly Classified Instances        N               N.n %
Kappa statistic                          N.n
Mean absolute error                      N.n
Root mean squared error                  N.n
Relative absolute error                 N.n %
Root relative squared error             N.n %
Total Number of Instances              226


Q9:  Load CSC458assn5.arff.gz into Weka.
Run unsupervised attribute filter NumericToNominal ONLY ON month
in order to make it a discrete value during clustering. Verify
that it has five discrete nominal values {8, 9, 10, 11, 12}.
Temporarily remove all attributes except month and SS_All.
After applying this filter, Go into Cluster -> SimpleKMeans, set the
numClusters configuration parameter to 5, run it and paste this table.
How does it relate to Figure 6 in Assignment 3's handout? Ignore the
Full Data column.

Final cluster centroids:
                        Cluster#
              (N.n)    (N.n)    (N.n)    (N.n)    (N.n)    (N.n)
=======================================================================
month               9         8        11        12        10         9
SS_All       N.n   N.n  N.n    N.n  N.n N.n

Q10: Load CSC458assn5.arff.gz and run an AddExpression that
creates a derived attribute wnd_NW_WNW that is the sum of
wnd_NW and wnd_WNW. We are doing this because Hawk Mountain
volunteers starting using 3-letter wind direction counts only in
1995, causing wnd_NW counts to decrease and wnd_WNW counts to
increase from 0. Analysis that we will go over in class revealed
that most of the decrease in the wnd_NW count went into wnd_WNW, not
wnd_NNW. Use the Reorder filter to make SS_All the final attribute
again. Temporarily remove all attributes except wnd_NW_WNW and SS_All,
Go into Cluster -> SimpleKMeans, set the numClusters configuration
parameter to 5, run it and paste this table. Ignore the Full Data column.
Can you see a general correlation between wnd_NW_WNW and SS_All?
Is there an exception?

Final cluster centroids:
                        Cluster#
Attribute    Full Data         0         1         2         3         4
               (N.n)    (N.n)    (N.n)    (N.n)    (N.n)    (N.n)

Q11: Load CSC458assn5.arff.gz  and delete year, month, and all SS_ALL
attributes except SS_All itself as the target. Run
Cluster -> SimpleKMeans with the numClusters=5 and visually inspect the
table to see if you can fins any non-target attribute that increases
or decreases monotonically with SS_All, i.e., not changing direction
as SS_All decreases. Keep the full table in Weka but just paste
the attribute here if you can find one. If you can't find one,
paste the closest attribute you can find. I will award 5 bonus
points if you can find one.

ANSWER BELOW HERE:

Final cluster centroids:
                               Cluster#
Attribute           Full Data         0         1         2         3         4
               (N.n)    (N.n)    (N.n)    (N.n)    (N.n)    (N.n)
===============================================================================
???attribute???        N.n   N.n   N.n   N.n   N.n  N.n
SS_All               N.n   N.n   N.n   N.n   N.n N.n
----------------------------------------------------------------
2g. Reread all questions and make sure you have answered all questions such
    as Landis & Koch categories for kappa and result-to-result comparisons.