ADDED a Python script to analyze monotonic cluster sequences for Q11 in this zip file.

We will go over it on May 8 in class.

Assignment 5 due 11:59 PM Thursday May 2 via

Download compressed ARFF data file

The ARFF raptor data are the same as for Assignment 3. We are going to compare new-to-us models to those of Assignment 3.

You must answer questions in

Start Weka, bring up the Explorer GUI, and

Set

Alternatively you can

This ARFF file has 45 attributes (columns) and 226 instances (rows) of monthly aggregate data from August through December of 1976 through 2021.

Here are the attributes in the file. It is a monthly aggregate of daily aggregates of (mostly 1-hour) observation periods.

year 1976-2021

month 8-12

HMtempC_mean mean for month of temp Celsius during observation times

WindSpd_mean same for wind speed in km/hour

HMtempC_median median for month

WindSpd_median

HMtempC_pstdv population standard deviation

WindSpd_pstdv

HMtempC_min minimum & maximum

WindSpd_min

HMtempC_max

WindSpd_max

wndN tally of North winds for all observations in the month, etc.

wndNNE

wndNE

wndENE

wndE

wndESE

wndSE

wndSSE

wndS

wndSSW

wndSW

wndWSW

wndW

wndWNW

wndNW

wndNNW

wndUNK

HMtempC_24_mean Changes in magnitude (absolute value of change) over 24, 48, and 72 hours

HMtempC_48_mean

HMtempC_72_mean

HMtempC_24_median

HMtempC_48_median

HMtempC_72_median

HMtempC_24_pstdv

HMtempC_48_pstdv

HMtempC_72_pstdv

HMtempC_24_min The min & max are their signed values.

HMtempC_48_min

HMtempC_72_min

HMtempC_24_max

HMtempC_48_max

HMtempC_72_max

SS_All Tally of sharp-shinned hawk observations during each month 8-12, 1976-2021. Target attribute.

You can examine its contents and sort based on attributes by opening the file in the Preprocess

Enter

Be careful

Enter

This step compresses SS_All.

The reason for adding +1 to aN is to avoid taking the log(0) for SS_All counts of 0, which is undefined. None will be negative.

Enter

Then Discretize SS_All_Log10_Range10 using the same steps.

SS_All_Range10

Save this dataset as

parameters ("meta-parameters") and fill in these numbers N and N.N

below by sweeping the output with your mouse, hitting control-C to copy,

and then pasting into README.assn5.txt, just like Assignment 3. Make

sure to include "using N nearest neighbor(s) from Weka's report.

IB1 instance-based classifier

using

...

Correlation coefficient N.N

Mean absolute error N.N

Root mean squared error N.N

Relative absolute error N %

Root relative squared error N %

Total Number of Instances 226

you have the

until you have the maximum correlation coefficient (CC). I did this two ways, First, I incremented

The first peak may be exceeded later. I will talk about "hill climbing" algorithms in class.

Second, I tried binary search, where I set

highest

I recommend the first approach. An incorrect peak

will lose points.

in the handout, paste the following Weka output, again including

"using N nearest neighbor(s)"

IB1 instance-based classifier

using

...

Correlation coefficient N.N

Mean absolute error N.N

Root mean squared error N.N

Relative absolute error N %

Root relative squared error N %

Total Number of Instances 226

and run

IB1 instance-based classifier

using

...

Correlation coefficient N.N

Mean absolute error N.N

Root mean squared error N.N

Relative absolute error N %

Root relative squared error N %

Total Number of Instances 226

1 at a time, not more than 10 consecutive times. Do the same

for Q2's KNN but incrementing it 1 at a time for no more than

10 times. What is the KNN now in terms of peak CC value?

Did KNN (number of nearest neighbor training instances) increase

or decrease from Q2's value?

IB1 instance-based classifier

using

...

Correlation coefficient N.N

Mean absolute error N.N

Root mean squared error N.N

Relative absolute error N %

Root relative squared error N %

Total Number of Instances 226

KNN value against SS_All_Range10 and then SS_All_Log10_Range10

in turn, making sure that no other SS_All-derived values are in

the data when classifying. Record these values. How does the

kappa in each of them rate in terms of Landis & Koch categories?

https://faculty.kutztown.edu/parson/fall2019/Fall2019Kappa.html

IB1 instance-based classifier

using N nearest neighbour(s) for classification

...

Correctly Classified Instances N N.n %

Incorrectly Classified Instances N N.n %

Kappa statistic N.n

Mean absolute error N.n

Root mean squared error N.n

Relative absolute error N.n %

Root relative squared error N.n %

Total Number of Instances 226

IB1 instance-based classifier

using N nearest neighbour(s) for classification

...

Correctly Classified Instances N N.n %

Incorrectly Classified Instances N N.n %

Kappa statistic N.n

Mean absolute error N.n

Root mean squared error N.n

Relative absolute error N.n %

Root relative squared error N.n %

Total Number of Instances 226

as the target attribute and no other SS_All-derived attributes

in the dataset. Run classifier

data and record these results. How does it compare to Q5's

kappa result. What is its Landis & Koch kappa range?

Correctly Classified Instances N N.n %

Incorrectly Classified Instances N N.n %

Kappa statistic N.n

Mean absolute error N.n

Root mean squared error N.n

Relative absolute error N.n %

Root relative squared error N.n %

Total Number of Instances 226

IBk (KNN) is sensitive to less-important non-target attributes contributing to

the distance-between-training-instances distance metric equally with important

ones. NaiveBayes is sensitive to partially redundant attributes skewing predictions

as outlined in the March 6 Zoom recording of class. We would like to eliminate both.

accept the Ranker pop-up for Search Method. Run this and list the top 12 in

descending order of CC:

Attribute Evaluator (supervised, Class (nominal): 45

Correlation Ranking Filter

Ranked attributes:

N.n N ?

N.n N ?

N.n N ?

N.n N ?

N.n N ?

N.n N ?

N.n N ?

N.n N ?

N.n N ?

N.n N ?

N.n N ?

N.n N ?

and the

leaving 13 attributes. Run I

and

Which of these 3 does the best in terms of kappa? Have IBk and

NaiveBayes improved or degraded in terms of kappa from their Q5

and Q6 counterparts respectively?

IBk:

IB1 instance-based classifier

using N nearest neighbour(s) for classification

...

Correctly Classified Instances N N.n %

Incorrectly Classified Instances N N.n %

Kappa statistic N.n

Mean absolute error N.n

Root mean squared error N.n

Relative absolute error N.n %

Root relative squared error N.n %

Total Number of Instances 226

NaiveBayes:

Correctly Classified Instances N N.n %

Incorrectly Classified Instances N N.n %

Kappa statistic N.n

Mean absolute error N.n

Root mean squared error N.n

Relative absolute error N.n %

Root relative squared error N.n %

Total Number of Instances 226

OneR:

=== Classifier model (full training set) ===

RULE GOES HERE

Correctly Classified Instances N N.n %

Incorrectly Classified Instances N N.n %

Kappa statistic N.n

Mean absolute error N.n

Root mean squared error N.n

Relative absolute error N.n %

Root relative squared error N.n %

Total Number of Instances 226

Run unsupervised attribute filter

in order to make it a discrete value during clustering. Verify

that it has five discrete nominal values {8, 9, 10, 11, 12}.

Temporarily remove all attributes except

After applying this filter, Go into

How does it relate to Figure 6 in Assignment 3's handout? Ignore the

Full Data column.

Final cluster centroids:

Cluster#

(N.n) (N.n) (N.n) (N.n) (N.n) (N.n)

=======================================================================

month 9 8 11 12 10 9

SS_All N.n N.n N.n N.n N.n N.n

creates a derived attribute

wnd_NW and wnd_WNW. We are doing this because Hawk Mountain

volunteers starting using 3-letter wind direction counts only in

1995, causing wnd_NW counts to decrease and wnd_WNW counts to

increase from 0. Analysis that we will go over in class revealed

that most of the decrease in the wnd_NW count went into wnd_WNW, not

wnd_NNW. Use the

again. Temporarily remove all attributes except wnd_NW_WNW and SS_All,

Go into

parameter to

Can you see a general correlation between wnd_NW_WNW and SS_All?

Is there an exception?

Final cluster centroids:

Cluster#

Attribute Full Data 0 1 2 3 4

(N.n) (N.n) (N.n) (N.n) (N.n) (N.n)

Cluster ->

table to see if you can fins any non-target attribute that increases

or decreases monotonically with SS_All, i.e., not changing direction

as SS_All decreases. Keep the full table in Weka but just paste

the attribute here if you can find one. If you can't find one,

paste the closest attribute you can find. I will award 5 bonus

points if you can find one.

ANSWER BELOW HERE:

Final cluster centroids:

Cluster#

Attribute Full Data 0 1 2 3 4

(N.n) (N.n) (N.n) (N.n) (N.n) (N.n)

===============================================================================

???attribute??? N.n N.n N.n N.n N.n N.n

SS_All N.n N.n N.n N.n N.n N.n

----------------------------------------------------------------

2g. Reread all questions and make sure you have answered all questions such

as Landis & Koch categories for kappa and result-to-result comparisons.