CSC 458 - Data Mining & Predictive Analytics I, Fall 2022, Assignment 2 on Numeric Regression.

Assignment 2 due 11:59 PM Friday October 14 via D2L Assignment 2. You can do this on any machine with Weka installed.
Oct 10 is a KU holiday and I will be on vacation the 11th & 13th. I will have Zoom office hours Wed the 12th.
The 11th is Monday's scheduled class. I will post a Zoom video for the October 13 class ahead of time.

The September 27 class will walk through a similar Weka session to this one. I will post the video on the course page.

1. To get the assignment:
Download compressed ARFF data file day_aggregate_HMS_1976_2021_kupapcsit01.arff.gz and Q&A file README.assn2.txt from these links.
You must answer questions in README.assn2.txt and save & later turn in working files CSC458assn2.arff.gz, train1.arff.gz, test1.arff.gz, train2.arff.gz, and test2.arff.gz.
Each answer for Q1 through Q15 in README.assn2.txt is worth 5 points, as is each of
CSC458assn2.arff.gz, train1.arff.gz, test1.arff.gz, train2.arff.gz, and test2.arff.gz with correct contents, totalling 100%. There is a 10% late penalty for each day the assignment is late.

2. Weka and README.assn2.txt operations
Start Weka, bring up the Explorer GUI, and open day_aggregate_HMS_1976_2021_kupapcsit01.arff.gz.
    Set Files of Type at the bottom of the Open window to (*.arff.gz) to see the input ARFF file. Double click it.

This ARFF file has 199 attributes (columns) and 5642 instances (rows) of daily aggregate data from August through December of 1976 through 2021.
    The attributes starting with Hourly, noaa, and dry are from the Allentown Airport. They are there to augment Hawk Mountain data and to study climate change.
    Hawk Mountain currently observes raptor counts and weather properties hourly during the day from August 15 through December 15.
    There are some early & late day observations. Most attributes are numeric, with a few date attributes. Here are some examples from the ARFF file.

@relation 'day_aggregate_HMS'
@attribute datetime date 'yyyy-MM-dd HH:mm:ss'
@attribute year numeric
@attribute yearSince1976 numeric
@attribute month numeric
@attribute monthday numeric
@attribute yearday numeric
@attribute daySinceAug1 numeric
@attribute duration numeric

Comment lines, when present, start with a "%" character. There are up to 4 data types: date (can include time), numeric, string, and nominal.
    We will see nominal (set-valued) attributes in Assignment 3 when we do Classification. This assignment is about Numeric Regression.
    Explore the attributes by looking at their statistical distributions on the right side and walking through them on the left side of the Preprocess Weka tab.
    This file is compressed. You can examine its contents and sort based on attributes by clicking the Preprocess Edit button.

2a. In the Preprocess tab delete all but the following 22 attributes.

@attribute year numeric
@attribute month numeric
@attribute monthday numeric
@attribute HMtempC numeric
@attribute WindSpd numeric
@attribute wndN numeric
@attribute wndNNE numeric
@attribute wndNE numeric
@attribute wndENE numeric
@attribute wndE numeric
@attribute wndESE numeric
@attribute wndSE numeric
@attribute wndSSE numeric
@attribute wndS numeric
@attribute wndSSW numeric
@attribute wndSW numeric
@attribute wndWSW numeric
@attribute wndW numeric
@attribute wndWNW numeric
@attribute wndNW numeric
@attribute wndNNW numeric
@attribute wndUNK numeric

You can delete the remaining attributes in one of three ways.
    2a1. Check in check boxes all you want to delete and then click the Remove button at the bottom.
    2a2.
Check in check boxes all you want to keep, hit the Invert button to reverse the selctions, and then click the Remove button at the bottom.
    2a3. Choose Filter -> unsupervised -> attribute -> Remove, set the comma-separated and dashed-range attribute numbers, then click Apply.
        Option 2a3 is useful when you have multiple ranges of attributes to Remove.

2b. Select Filter -> unsupervised -> attribute -> Reorder to make WindSpd the final attribute at the bottom without reordering the others.
    We are doing this step because we plan to correlate the remaining non-target attributes to WindSpd as the target attribute.
    Click Apply to reorder the attributes and check the result.
    When it is correct, Save as CSC458assn2.arff.gz AFTER selecting File of Type to (*.arff.gz). You will turn this file in to D2L.
    Whenever you Open or Save an ARFF file, select
File of Type to (*.arff.gz) AND make sure the file name ends in .arff.gz.
    I will go over the Weka GUI steps on 9/27. You can watch the posted video to review.
    In general, be careful with editing data. An erroneous edit can change analysis results substantially.

2c / Q1. Go to the Classify tab, select function -> SimpleLinearRegression, hit Start, and examine the output.
Copy from Weka (use control-C, not a menu) and paste into README.assn2.txt Q1 ONLY the following results:

Linear regression on ???

???

Predicting ??? if attribute value is missing.
Correlation coefficient                  ???
Mean absolute error                      ???
Root mean squared error                  ???
Relative absolute error                 ??? %
Root relative squared error             ???  %
Total Number of Instances             ???    
Ignored Class Unknown Instances                ???

The ???s above are for result data supplied by Weka. Paste only those lines with the actual results.

README.assn2.txt Q2: Based on the attribute identified by the "Linear regression on ???" output from Weka, which attribute most closely correlates with WindSpd?
Why does this attribute correlate closely with WindSpd? (Hints: Does it have a positive or negative multiplier in the regression formula? Positive indicates positive correlation and negative indicates negative. You can consult the current Hawk Mountain Study, Section 2 and search for this attribute name.)

2d / Q3. Go to the Classify tab, select function -> LinearRegression, hit Start, and examine the output.
Copy from Weka (use control-C, not a menu) and paste into README.assn2.txt Q3 ONLY the following results:

WindSpd =
    ??? (Paste full linear expression here.)
Correlation coefficient                  ???
Mean absolute error                      ???
Root mean squared error                  ???
Relative absolute error                 ??? %
Root relative squared error             ??? %
Total Number of Instances             ???
Ignored Class Unknown Instances                ???

README.assn2.txt Q4: Which attribute in the full linear expression has the strongest correlation, positive or negative, with WindSpd? Ignore the sign and just use the absolute value of the multiplier in deciding your answer.  How does this relate to your answers in Q2?

2e. In the Preprocess tab, APPLY Filter -> unsupervised -> attribute -> Normalize to place all of the attributes except target WindSpd into the same scale [0.0, 1.0] by computing, for each AttributeValue in a row,
    ((AttributeValue - min(AttributeValueForThatCoumn)) / (max(AttributeValueForThatCoumn) - min(AttributeValueForThatCoumn)))
We are normalizing in order to get the multiplier weights for the non-target attributes on the same scale. These weights show the multipliers' actual correlation to the target. The unnormalized multipliers are in application-domain ranges that may vary widely by attribute. Inspect the attributes in the Preprocess tab to ensure that all are in the range [0.0, 1.0] except for the target attribute, which is in its original range.

README.assn2.txt Q5:
Go to the Classify tab, select function -> LinearRegression again, this time with normalized non-target attributes, hit Start, and examine the output. Copy from Weka and paste into README.assn2.txt Q5 ONLY the following results:

WindSpd =
    ??? (Paste full linear expression here.)
Correlation coefficient                  ???
Mean absolute error                      ???
Root mean squared error                  ???
Relative absolute error                 ??? %
Root relative squared error             ??? %
Total Number of Instances             ???
Ignored Class Unknown Instances                ???

Q6: What are the top TWO non-target attributes of Q5 in terms of absolute multiplier magnitude. Has normalization changed the ID of the strongest- correlation attribute? Has normalization changed the ID of the second-strongest-attribute? Give the names of the strongest and second-strongest correlated attribute from Q5 and then those from Q3. What accounts for any changes?

Q7: Has Correlation coefficient or any of the error measures changed significantly (more than 5%, or at all) in going from Q3 to Q5?

2f / Q8. Go to the Classify tab, select tree -> M5P, hit Start, and examine the output. The non-target attributes are still normalized.
Copy from Weka and paste into README.assn2.txt Q8 ONLY the following results:


M5 pruned model tree:
(using smoothed linear models)

??? <= ???.045 : LM1 (3567/97.151%)
??? >  ???.045 : LM2 (1850/65.199%)

LM num: 1
WindSpd = (Paste full linear expression here.)

LM num: 2
WindSpd = (Paste full linear expression here.)

Number of Rules : 2
Correlation coefficient                  ???
Mean absolute error                      ???
Root mean squared error                  ???
Relative absolute error                 ??? %
Root relative squared error             ??? %
Total Number of Instances             ???

Q9: In terms of the M5P decision tree that uses (<, <=, >, or >=) operators, which attribute is the most important in selecting a linear expression to run? How does this agree or disagree with your earlier analyses above?

Q10: How is the attribute of Q9's multiplier weight in formula LM num: 1 compared with LM num: 2? What might account for the reduction in importance in one of these expressions. (Hint: What does the decision tree accomplish?)

2g: In the Preprocess tab load your CSC458assn2.arff.gz with unnormalized (original) values into Weka, run Filter unsupervised -> instance -> RemovePercentage with the default configuration parameter of 50% and invertSelection = False, Apply and verify that only 2821 instances remain. Save this as train1.arff.gz. Use the Edit window to see that instances are in ascending order by (year, month, monday) starting in August 2000.

Hit Undo once and verify 5642 instances are back, change the RemovePercentage config parameter invertSelection to True, click OK, and Apply this filter, verifying 2821 instances.
Save this as test1.arff.gz. Use the Edit window to see that instances are in ascending order by (year, month, monday) starting in August 1976.

Load train1.arff.gz into Weka to train models, and use the 
Edit window to see that instances are in ascending order by (year, month, monday) starting in August 2000.

Go to the Classify tab and change testing to "Supplied test set", clicking the Set button to select test1.arff.gz as the test dataset. We are training a model on train1.arff.gz (starting in year 2000) and testing on test1.arff.gz (staring in year 1976).

Q11: In Classify run LinearRegression and record only these results.

Correlation coefficient                  ???
Mean absolute error                      ???
Root mean squared error                  ???

Q12: How do the Correlation coefficient (CC),  Mean absolute error (MAE), and Root mean squared error (RMSE) of Q11 compare to those of LinearRegression in Q3 and Q5? Did they get better, worse, or stay the same? List the actual values like this, putting in the values for CC, MAE, and RMSE:
Q3  CC  MAE RMSE
Q5  CC  MAE RMSE
Q11 CC  MAE RMSE

2h: In the Preprocess tab load your CSC458assn2.arff.gz with unnormalized (original) values into Weka, run Filter unsupervised -> instance -> Randomize with a default seed of 42 (make sure to Apply Randomize ONLY ONCE) before running  Filter unsupervised -> instance -> RemovePercentage with the default configuration parameter of 50% and invertSelection = False, Apply and verify that only 2821 instances remain. Save this as train2.arff.gz. Use the Edit window to see that instances are in random order.

Hit Undo once and verify 5642 instances are back, change the RemovePercentage config parameter invertSelection to True, click OK, and Apply this filter, verifying 2821 instances.
Save this as test2.arff.gz. Use the Edit window to see that instances are in random order.

Load train2.arff.gz into Weka to train models
.

Go to the Classify tab and change testing to "Supplied test set", clicking the Set button to select test2.arff.gz as the test dataset. We are training a model on train2.arff.gz and testing on test1.arff.gz. Their order is randomized but they use different, shuffled instances from
CSC458assn2.arff.gz.

Q13: In Classify run LinearRegression and record only these results.

Correlation coefficient                  ???
Mean absolute error                      ???
Root mean squared error                  ???

Q14: How do the Correlation coefficient (CC),  Mean absolute error (MAE), and Root mean squared error (RMSE) of Q11 compare to those of LinearRegression in Q11? Did they get better, worse, or stay the same? List the actual values like this, putting in the values for CC, MAE, and RMSE:
Q11 CC  MAE RMSE
Q14 CC MAE RMSE

Q15: What accounts for the changes in Q14's measures in going from Q11 to Q13?