nCSC 458 - Data
Mining & Predictive Analytics I, Spring 2024, Assignment 5
on Bayesian & Instance-based Analysis and Clustering.
ADDED a Python script to analyze
monotonic cluster sequences for Q11 in
this zip file.
We will go over it on May 8 in class.
Assignment 5 due 11:59 PM Thursday May 2 via D2L
Assignment 5.
We will walk through this spec and allot at least an hour of
class work time.
Q1
through Q11 in
README.assn5.txt
are worth 8%
each and a
correct CSC458assn5.arff.gz file is worth 12%. There is a
10% penalty for each day it is late to D2L.
1. To get the assignment:
Download compressed ARFF data file month_HM_reduced_aggregate.arff.gz
and Q&A file README.assn5.txt from these
links.
The ARFF raptor data are the same
as for Assignment 3. We are going to compare new-to-us
models to those of Assignment 3.
You must answer questions in README.assn5.txt and save
& later turn in working file CSC458assn5.arff.gz.
Each answer for Q1 through Q11 in README.assn5.txt is worth 8
points, and CSC458assn5.arff.gz
with correct contents is worth 12%, totaling 100%. There is a 10%
late penalty for each day the assignment is late, and it needs to
be in before I go over my solution on May 8 to earn points.
2. Weka and README.assn5.txt operations
Start Weka, bring up the Explorer GUI, and open
month_HM_reduced_aggregate.arff.gz.
Set Files of Type at the bottom of the
Open window to (*.arff.gz) to see the input ARFF file. Double
click it.
Alternatively you can gunzip month_HM_reduced_aggregate.arff.gz to
get text file month_HM_reduced_aggregate.arff.
This ARFF file has 45 attributes (columns) and 226 instances
(rows) of monthly aggregate data from August through December of
1976 through 2021.
Here are the attributes in the file. It is a
monthly aggregate of daily aggregates of (mostly 1-hour)
observation periods.
year
1976-2021
month
8-12
HMtempC_mean
mean for month of temp Celsius during observation times
WindSpd_mean
same for wind speed in km/hour
HMtempC_median
median for month
WindSpd_median
HMtempC_pstdv
population standard deviation
WindSpd_pstdv
HMtempC_min
minimum & maximum
WindSpd_min
HMtempC_max
WindSpd_max
wndN
tally of North winds for all observations in the month, etc.
wndNNE
wndNE
wndENE
wndE
wndESE
wndSE
wndSSE
wndS
wndSSW
wndSW
wndWSW
wndW
wndWNW
wndNW
wndNNW
wndUNK
HMtempC_24_mean
Changes in magnitude (absolute value of change) over 24, 48, and
72 hours
HMtempC_48_mean
HMtempC_72_mean
HMtempC_24_median
HMtempC_48_median
HMtempC_72_median
HMtempC_24_pstdv
HMtempC_48_pstdv
HMtempC_72_pstdv
HMtempC_24_min
The min & max are their signed values.
HMtempC_48_min
HMtempC_72_min
HMtempC_24_max
HMtempC_48_max
HMtempC_72_max
SS_All
Tally of sharp-shinned hawk observations during each month 8-12,
1976-2021. Target attribute.
You can examine its contents and sort based on attributes by
opening the file in the Preprocess Edit button.
2a. In the Preprocess tab open Filter -> unsupervised ->
attribute -> AddExpression and add the following 4 derived
attributes.
Enter name
SS_All_Range10 with an expression aN, where N is
the attribute number of SS_All, which is our primary
target attribute. (Use a45 for attribute 45, not "45" or "aN".)
Apply.
Be careful NOT TO
INCLUDE SPACES within your derived attribute names.
Enter name SS_All_Log10 with
an expression log(aN+1)/log(10), where N is the
attribute number of SS_All. Apply.
This step compresses
SS_All.
The reason for adding +1
to aN is to avoid taking the log(0) for SS_All counts of 0,
which is undefined. None will be negative.
Enter name SS_All_Log10_Range10 with an expression aN,
where N is the attribute number of SS_All_Log10 , which is target attribute just
derived. Apply.
At this point you have 48 attributes.
2b. Select Filter -> unsupervised -> attribute ->
Discretize to chop SS_All_Range10 into 10 discrete classes as
follows.
Then Discretize
SS_All_Log10_Range10 using the same steps.
Set the Discretize attributeIndices
to the index of SS_All_Range10, leave the other
Discretize parameters at their defaults (useEqualFrequency is
False), and Apply.
Set the Discretize attributeIndices to the index of SS_All_Log10_Range10,
change config parameter ignoreClass to True (because
Weka considers this attribute to be the target attribute at
this point),leave the other Discretize parameters at their
defaults (useEqualFrequency is False), and Apply.
SS_All_Range10 and SS_All_Log10_Range10 are
the only nominal attributes at this point. The remaining 46 are
numeric. Please verify this.
Save this dataset as CSC458assn5.arff.gz, making sure
"Files of Type" is set to (*.arff.gz) and the file is named
correctly. It has 48 attributes.Turn in CSC458assn5.arff.gz
when your turn in your README.assn5.txt to D2L. Make
sure to use only one variant of SS_All or its derived
attributes
during regression, classification, or clustering.
Predicting one variant of the target attribute as a function
of another target variant is a mistake.
2c. Remove derived
attributes SS_All_Range10, SS_All_Log10 and SS_All_Log10_Range10 so that SS_All is the only target attribute,
following non-target attribute HMtempC_72_max. We will use them
later.
REGRESSION:
Q1: In the Classify TAB run lazy -> IBk with its
default configuration
parameters ("meta-parameters") and fill in these numbers N and N.N
below by sweeping the output with your mouse, hitting control-C to
copy,
and then pasting into README.assn5.txt, just like Assignment 3.
Make
sure to include "using N nearest neighbor(s) from Weka's report.
IB1 instance-based classifier
using N nearest neighbour(s) for classification
...
Correlation
coefficient
N.N
Mean absolute
error
N.N
Root mean squared
error
N.N
Relative absolute
error
N %
Root relative squared
error
N %
Total Number of
Instances
226
PREP FOR Q2: Repeated change the KNN configuration
parameter for IBk, run it again, until
you have the KNN (number of nearest training neighbors
for which to average their SS_All) values
until you have the maximum correlation coefficient (CC). I did
this two ways, First, I incremented
KNN 1 at a time until I hit a CC peak that was
sustained through the following eight higher KNN values.
The first peak may be exceeded later. I will talk about "hill
climbing" algorithms in class.
Second, I tried binary search, where I set KNN to halfway
points between the previous two
highest KNN values in terms of CC, but this was
no less tedious, and harder to keep track of.
I recommend the first approach. An incorrect peak KNN value
in terms of CC in your answer
will lose points.
Q2: After finding the peak correlation coefficient (CC)
for IBk's
KNN (K nearest neighbors) configuration parameter per
instructions
in the handout, paste the following Weka output, again including
"using N nearest neighbor(s)"
IB1 instance-based
classifier
using N nearest neighbour(s) for classification
...
Correlation
coefficient
N.N
Mean absolute
error
N.N
Root mean squared
error
N.N
Relative absolute
error
N %
Root relative squared
error
N %
Total Number of
Instances
226
Q3: Load CSC458assn5.arff.gz into
Weka or execute UNDO until SS_All_Log10 is back. Remove the
other SS_All-derived attributes
and run IBk with the KNN value for peak CC
that you determined in Q2. Print the following measures. What
accounts for the
change in CC? (Refer to the Assignment 3 analyses.)
IB1 instance-based
classifier
using N nearest neighbour(s) for classification
...
Correlation
coefficient
N.N
Mean absolute
error
N.N
Root mean squared
error
N.N
Relative absolute
error
N %
Root relative squared
error
N %
Total Number of
Instances
226
Q4: Take
the KNN parameter for peak CC of Q2 and reduce it
1 at a time, not more than 10 consecutive times. Do the same
for Q2's KNN but incrementing it 1 at a time for no more than
10 times. What is the KNN now in terms of peak CC value?
Did KNN (number of nearest neighbor training instances) increase
or decrease from Q2's value? Given what you know about
logarithmic
compression from Assignment 3, why do you think this KNN
value
made this change from Q2?
IB1 instance-based classifier
using N nearest neighbour(s) for classification
...
Correlation
coefficient
N.N
Mean absolute
error
N.N
Root mean squared
error
N.N
Relative absolute
error
N %
Root relative squared
error
N %
Total Number of
Instances
226
Q5: Using the KNN value from Q4 for the peak CC,
run IBk with that
KNN value against SS_All_Range10 and then
SS_All_Log10_Range10
in turn, making sure that no other SS_All-derived values
are in
the data when classifying. Record these values. How does
the
kappa in each of them rate in terms of Landis & Koch
categories?
https://faculty.kutztown.edu/parson/fall2019/Fall2019Kappa.html
SS_All_Range10:
IB1 instance-based classifier
using N nearest neighbour(s) for classification
...
Correctly Classified
Instances
N
N.n %
Incorrectly Classified
Instances
N
N.n %
Kappa
statistic
N.n
Mean absolute
error
N.n
Root mean squared
error
N.n
Relative absolute
error
N.n %
Root relative squared
error
N.n %
Total Number of
Instances
226
SS_All_Log10_Range10:
IB1 instance-based classifier
using N nearest neighbour(s) for classification
...
Correctly Classified
Instances
N
N.n %
Incorrectly Classified
Instances
N
N.n %
Kappa
statistic
N.n
Mean absolute
error
N.n
Root mean squared
error
N.n
Relative absolute
error
N.n %
Root relative squared
error
N.n %
Total Number of
Instances
226
Q6: From here through Q8 we will classify with SS_All_Log10_Range10
as the target attribute and no other SS_All-derived
attributes
in the dataset. Run classifier bayes ->
NaiveBayes against this
data and record these results. How does it compare to
Q5's
kappa result. What is its Landis & Koch kappa range?
Correctly Classified
Instances
N
N.n %
Incorrectly Classified
Instances
N
N.n %
Kappa
statistic
N.n
Mean absolute
error
N.n
Root mean squared
error
N.n
Relative absolute
error
N.n %
Root relative squared
error
N.n %
Total Number of
Instances
226
PREP FOR REMAINING STEPS:
IBk (KNN) is sensitive to less-important
non-target attributes contributing to
the distance-between-training-instances
distance metric equally with important
ones. NaiveBayes is sensitive to partially
redundant attributes skewing predictions
as outlined in the March 6 Zoom recording of
class. We would like to eliminate both.
Q7: In the Select attributes Weka
tab, Choose CorrelationAttributeEval and
accept the Ranker pop-up for Search Method.
Run this and list the top 12 in
descending order of CC:
Attribute Evaluator (supervised, Class
(nominal): 45 SS_All_Log10_Range10):
Correlation Ranking
Filter
Ranked attributes:
N.n N ?
N.n N ?
N.n N ?
N.n N ?
N.n N ?
N.n N ?
N.n N ?
N.n N ?
N.n N ?
N.n N ?
N.n N ?
N.n N ?
Q8: Remove all attributes except the
target SS_All_Log10_Range10
and the top 12 of Q7 in terms of CC
to SS_All_Log10_Range10,
leaving 13 attributes. Run IBk with the
KNN of Q4 and Q5, NaiveBayes,
and OneR, pasting its decision rule as
well as the other measures.
Which of these 3 does the best in terms of
kappa? Have IBk and
NaiveBayes improved or degraded in terms of
kappa from their Q5
and Q6 counterparts respectively?
IBk:
IB1 instance-based classifier
using N nearest neighbour(s) for
classification
...
Correctly Classified
Instances
N
N.n %
Incorrectly Classified
Instances
N
N.n %
Kappa
statistic
N.n
Mean absolute
error
N.n
Root mean squared
error
N.n
Relative absolute
error
N.n %
Root relative squared
error
N.n %
Total Number of
Instances
226
NaiveBayes:
Correctly Classified
Instances
N
N.n %
Incorrectly Classified
Instances
N
N.n %
Kappa
statistic
N.n
Mean absolute
error
N.n
Root mean squared
error
N.n
Relative absolute
error
N.n %
Root relative squared
error
N.n %
Total Number of
Instances
226
OneR:
=== Classifier model (full training set) ===
RULE GOES HERE
Correctly Classified
Instances
N
N.n %
Incorrectly Classified
Instances
N
N.n %
Kappa
statistic
N.n
Mean absolute
error
N.n
Root mean squared
error
N.n
Relative absolute
error
N.n %
Root relative squared
error
N.n %
Total Number of
Instances
226
Q9: Load CSC458assn5.arff.gz
into Weka.
Run unsupervised attribute filter
NumericToNominal ONLY ON month
in order to make it a discrete value during
clustering. Verify
that it has five discrete nominal values {8,
9, 10, 11, 12}.
Temporarily remove all attributes except
month and SS_All.
After applying this filter, Go into Cluster
-> SimpleKMeans, set the
numClusters configuration parameter
to 5, run it and paste this table.
How does it relate to Figure 6 in Assignment
3's handout? Ignore the
Full Data column.
Final cluster centroids:
Cluster#
(N.n)
(N.n)
(N.n)
(N.n)
(N.n) (N.n)
=======================================================================
month
9
8
11
12
10
9
SS_All
N.n N.n
N.n N.n N.n N.n
Q10: Load CSC458assn5.arff.gz and run
an AddExpression that
creates a derived attribute wnd_NW_WNW
that is the sum of
wnd_NW and wnd_WNW. We are doing this
because Hawk Mountain
volunteers starting using 3-letter wind
direction counts only in
1995, causing wnd_NW counts to decrease and
wnd_WNW counts to
increase from 0. Analysis that we will go
over in class revealed
that most of the decrease in the wnd_NW
count went into wnd_WNW, not
wnd_NNW. Use the Reorder filter to
make SS_All the final attribute
again. Temporarily remove all attributes
except wnd_NW_WNW and SS_All,
Go into Cluster -> SimpleKMeans,
set the numClusters configuration
parameter to 5, run it and paste
this table. Ignore the Full Data column.
Can you see a general correlation between
wnd_NW_WNW and SS_All?
Is there an exception?
Final cluster centroids:
Cluster#
Attribute Full
Data
0
1
2
3
4
(N.n)
(N.n)
(N.n)
(N.n)
(N.n) (N.n)
Q11: Load CSC458assn5.arff.gz
and delete year, month, and all SS_ALL
attributes except SS_All itself
as the target. Run
Cluster -> SimpleKMeans with the
numClusters=5 and visually inspect
the
table to see if you can fins any non-target
attribute that increases
or decreases monotonically with SS_All,
i.e., not changing direction
as SS_All decreases. Keep the full table in
Weka but just paste
the attribute here if you can find one. If
you can't find one,
paste the closest attribute you can find. I
will award 5 bonus
points if you can find one.
ANSWER BELOW HERE:
Final cluster centroids:
Cluster#
Attribute
Full
Data
0
1
2
3
4
(N.n)
(N.n)
(N.n)
(N.n)
(N.n) (N.n)
===============================================================================
???attribute???
N.n N.n
N.n N.n N.n
N.n
SS_All
N.n N.n
N.n N.n N.n N.n
----------------------------------------------------------------
2g. Reread all questions and make sure you
have answered all questions such
as Landis & Koch
categories for kappa and result-to-result
comparisons.