CSC 458 - Data
Mining & Predictive Analytics I, Fall 2022, Assignment 2
on Numeric Regression.
Assignment 2 due 11:59 PM Friday October 14 via D2L
Assignment 2. You can do this on any machine with Weka
installed.
Oct 10 is a KU holiday and I will be on vacation the 11th
& 13th. I will have Zoom office hours Wed the 12th.
The 11th is Monday's scheduled class. I will post a Zoom video
for the October 13 class ahead of time.
The September 27 class will walk through a similar Weka session
to this one. I will post the video on the course page.
1. To get the assignment:
Download compressed ARFF data file day_aggregate_HMS_1976_2021_kupapcsit01.arff.gz
and Q&A file README.assn2.txt from these
links.
You must answer questions in README.assn2.txt and save
& later turn in working files CSC458assn2.arff.gz,
train1.arff.gz, test1.arff.gz, train2.arff.gz, and
test2.arff.gz.
Each answer for Q1 through Q15 in README.assn2.txt is worth 5
points, as is each of CSC458assn2.arff.gz,
train1.arff.gz, test1.arff.gz, train2.arff.gz, and
test2.arff.gz with correct contents, totalling
100%. There is a 10% late penalty for each day the assignment is
late.
2. Weka and README.assn2.txt operations
Start Weka, bring up the Explorer GUI, and open
day_aggregate_HMS_1976_2021_kupapcsit01.arff.gz.
Set Files of Type at the bottom of the
Open window to (*.arff.gz) to see the input ARFF file. Double
click it.
This ARFF file has 199 attributes (columns) and 5642 instances
(rows) of daily aggregate data from August through December of
1976 through 2021.
The attributes starting with Hourly, noaa, and
dry are from the Allentown Airport. They are there to augment Hawk
Mountain data and to study climate change.
Hawk Mountain currently observes raptor counts
and weather properties hourly during the day from August 15
through December 15.
There are some early & late day
observations. Most attributes are numeric, with a few date
attributes. Here are some examples from the ARFF file.
@relation 'day_aggregate_HMS'
@attribute datetime date 'yyyy-MM-dd HH:mm:ss'
@attribute year numeric
@attribute yearSince1976 numeric
@attribute month numeric
@attribute monthday numeric
@attribute yearday numeric
@attribute daySinceAug1 numeric
@attribute duration numeric
Comment lines, when present, start with a "%" character. There are
up to 4 data types: date (can include time), numeric, string, and
nominal.
We will see nominal (set-valued) attributes in
Assignment 3 when we do Classification. This assignment is about
Numeric Regression.
Explore the attributes by looking at their
statistical distributions on the right side and walking through
them on the left side of the Preprocess Weka tab.
This file is compressed. You can examine its
contents and sort based on attributes by clicking the Preprocess Edit
button.
2a. In the Preprocess tab delete all but the following 22
attributes.
@attribute year numeric
@attribute month numeric
@attribute monthday numeric
@attribute HMtempC numeric
@attribute WindSpd numeric
@attribute wndN numeric
@attribute wndNNE numeric
@attribute wndNE numeric
@attribute wndENE numeric
@attribute wndE numeric
@attribute wndESE numeric
@attribute wndSE numeric
@attribute wndSSE numeric
@attribute wndS numeric
@attribute wndSSW numeric
@attribute wndSW numeric
@attribute wndWSW numeric
@attribute wndW numeric
@attribute wndWNW numeric
@attribute wndNW numeric
@attribute wndNNW numeric
@attribute wndUNK numeric
You can delete the remaining attributes in one of three ways.
2a1. Check in check boxes all you want to
delete and then click the Remove button at the bottom.
2a2. Check
in check boxes all you want to keep, hit the Invert button to
reverse the selctions, and then click the Remove button at the
bottom.
2a3. Choose Filter -> unsupervised ->
attribute -> Remove, set the comma-separated and dashed-range
attribute numbers, then click Apply.
Option 2a3 is useful when
you have multiple ranges of attributes to Remove.
2b. Select Filter -> unsupervised -> attribute ->
Reorder to make WindSpd the final attribute at the bottom
without reordering the others.
We are doing this step because we plan to
correlate the remaining non-target attributes to WindSpd as the
target attribute.
Click Apply to reorder the
attributes and check the result.
When it is correct, Save as CSC458assn2.arff.gz
AFTER selecting File of Type to (*.arff.gz). You will
turn this file in to D2L.
Whenever you Open or Save an ARFF file,
select File of Type to (*.arff.gz)
AND make sure the file name ends in .arff.gz.
I will go over the Weka GUI steps on 9/27.
You can watch the posted video to review.
In general, be careful with editing data. An
erroneous edit can change analysis results substantially.
2c / Q1. Go to the Classify tab, select function ->
SimpleLinearRegression, hit Start, and examine the output.
Copy from Weka (use control-C, not a menu) and paste into README.assn2.txt
Q1 ONLY the following results:
Linear regression on ???
???
Predicting ??? if attribute value is missing.
Correlation
coefficient
???
Mean absolute
error
???
Root mean squared
error
???
Relative absolute
error
??? %
Root relative squared
error
??? %
Total Number of
Instances
???
Ignored Class Unknown
Instances
???
The ???s above are for result data supplied by Weka. Paste only
those lines with the actual results.
README.assn2.txt Q2: Based on the attribute identified by
the "Linear regression on ???" output from Weka, which attribute
most closely correlates with WindSpd?
Why does this attribute correlate closely with WindSpd? (Hints:
Does it have a positive or negative multiplier in the regression
formula? Positive indicates positive correlation and negative
indicates negative. You can consult the current Hawk
Mountain Study, Section 2 and search for this
attribute name.)
2d / Q3. Go to the Classify tab, select function ->
LinearRegression, hit Start, and examine the output.
Copy from Weka (use control-C, not a menu) and paste into README.assn2.txt
Q3 ONLY the following results:
WindSpd =
??? (Paste full linear expression here.)
Correlation
coefficient
???
Mean absolute
error
???
Root mean squared
error
???
Relative absolute
error
??? %
Root relative squared
error
??? %
Total Number of
Instances
???
Ignored Class Unknown
Instances
???
README.assn2.txt Q4: Which attribute in the full linear
expression has the strongest correlation, positive or negative,
with WindSpd? Ignore the sign and just use the absolute value of
the multiplier in deciding your answer. How does this
relate to your answers in Q2?
2e. In the Preprocess tab, APPLY Filter
-> unsupervised -> attribute -> Normalize to
place all of the attributes except target WindSpd into the same
scale [0.0, 1.0] by computing, for each AttributeValue in a row,
((AttributeValue -
min(AttributeValueForThatCoumn)) /
(max(AttributeValueForThatCoumn) -
min(AttributeValueForThatCoumn)))
We are normalizing in order to get the multiplier weights for
the non-target attributes on the same scale. These weights show
the multipliers' actual correlation to the target. The
unnormalized multipliers are in application-domain ranges that
may vary widely by attribute. Inspect the attributes in the
Preprocess tab to ensure that all are in the range [0.0, 1.0]
except for the target attribute, which is in its original range.
README.assn2.txt Q5:
Go to the Classify tab, select function -> LinearRegression
again, this time with normalized non-target attributes, hit
Start, and examine the output. Copy from Weka and paste into
README.assn2.txt Q5 ONLY the following results:
WindSpd =
??? (Paste full linear expression here.)
Correlation
coefficient
???
Mean absolute
error
???
Root mean squared
error
???
Relative absolute
error
??? %
Root relative squared
error
??? %
Total Number of
Instances
???
Ignored Class Unknown
Instances
???
Q6: What are the top TWO non-target attributes of Q5 in
terms of absolute multiplier magnitude. Has normalization
changed the ID of the strongest- correlation attribute? Has
normalization changed the ID of the second-strongest-attribute?
Give the names of the strongest and second-strongest correlated
attribute from Q5 and then those from Q3. What accounts for any
changes?
Q7: Has Correlation coefficient or any of the error
measures changed significantly (more than 5%, or at all) in
going from Q3 to Q5?
2f
/ Q8. Go to the Classify tab, select tree -> M5P,
hit Start, and examine the output. The non-target attributes
are still normalized.
Copy from Weka and paste into README.assn2.txt Q8
ONLY the following results:
M5 pruned model tree:
(using smoothed linear models)
??? <= ???.045 : LM1 (3567/97.151%)
??? > ???.045 : LM2 (1850/65.199%)
LM num: 1
WindSpd = (Paste full linear expression here.)
LM num: 2
WindSpd = (Paste full linear expression here.)
Number of Rules : 2
Correlation
coefficient
???
Mean absolute
error
???
Root mean squared
error
???
Relative absolute
error
??? %
Root relative squared
error
??? %
Total Number of
Instances
???
Q9: In terms of the M5P decision tree that uses (<,
<=, >, or >=) operators, which attribute is the most
important in selecting a linear expression to run? How does this
agree or disagree with your earlier analyses above?
Q10: How is the attribute of Q9's multiplier weight in
formula LM num: 1 compared with LM num: 2? What might account
for the reduction in importance in one of these expressions.
(Hint: What does the decision tree accomplish?)
2g: In the Preprocess tab load your CSC458assn2.arff.gz
with unnormalized (original) values into Weka, run Filter
unsupervised -> instance -> RemovePercentage with
the default configuration parameter of 50% and invertSelection
= False, Apply and verify that only 2821 instances remain.
Save this as train1.arff.gz. Use the Edit window to see
that instances are in ascending order by (year, month, monday)
starting in August 2000.
Hit Undo once and verify 5642 instances are back,
change the RemovePercentage config parameter invertSelection
to True, click OK, and Apply this filter, verifying 2821
instances. Save this as test1.arff.gz.
Use the Edit window to see that instances are in ascending
order by (year, month, monday) starting in August 1976.
Load train1.arff.gz into Weka to train models, and
use the Edit window to see that instances are in
ascending order by (year, month, monday) starting in
August 2000.
Go to the Classify tab and change testing to "Supplied
test set", clicking the Set button to select
test1.arff.gz as the test dataset. We are
training a model on train1.arff.gz (starting in year
2000) and testing on test1.arff.gz (staring in year
1976).
Q11: In Classify run LinearRegression and record only these
results.
Correlation
coefficient
???
Mean absolute
error
???
Root mean squared
error
???
Q12: How do the Correlation coefficient (CC), Mean
absolute error (MAE), and Root mean squared error (RMSE) of Q11
compare to those of LinearRegression in Q3 and Q5? Did they get
better, worse, or stay the same? List the actual values like this,
putting in the values for CC, MAE, and RMSE:
Q3 CC MAE RMSE
Q5 CC MAE RMSE
Q11 CC MAE RMSE
2h: In the
Preprocess tab load your CSC458assn2.arff.gz with
unnormalized (original) values into Weka, run Filter unsupervised ->
instance -> Randomize with
a default seed of 42 (make sure to Apply Randomize ONLY
ONCE) before running Filter unsupervised ->
instance -> RemovePercentage with the default
configuration parameter of 50% and invertSelection
= False, Apply and verify that only 2821 instances
remain. Save this as train2.arff.gz. Use the Edit
window to see that instances are in random order.
Hit Undo once and verify 5642 instances are back,
change the RemovePercentage config parameter invertSelection
to True, click OK, and Apply this filter, verifying 2821
instances. Save this as test2.arff.gz.
Use the Edit window to see that instances are in random
order.
Load train2.arff.gz into Weka to train models.
Go to the Classify tab and change testing to "Supplied
test set", clicking the Set button to select test2.arff.gz
as the test dataset. We are training a model on
train2.arff.gz and testing on test1.arff.gz. Their
order is randomized but they use different, shuffled
instances from CSC458assn2.arff.gz.
Q13: In Classify run LinearRegression and record only
these results.
Correlation
coefficient
???
Mean absolute
error
???
Root mean squared
error
???
Q14: How do the Correlation coefficient (CC), Mean
absolute error (MAE), and Root mean squared error (RMSE) of Q11
compare to those of LinearRegression in Q11? Did they get
better, worse, or stay the same? List the actual values like
this, putting in the values for CC, MAE, and RMSE:
Q11 CC MAE RMSE
Q14 CC MAE RMSE
Q15: What accounts for the changes in Q14's measures in
going from Q11 to Q13?