> Why would there be so much difference based on the (instance) order of that expanded (max) set? I was really surprised at amount of difference between the two sets of results. There are two issues here. In the min datasets the training instances and testing instances are mutually exclusive at that point, and even though they have been shuffled, there are too few training instances for most of the regressor-building algorithms. K-nearest-neighbors is really the only one geared towards a small training set, but if it is not a good cross-section of the test data, it'll still give poor CCs and other measures. With the max datasets, there are two issues. Minor differences in instance order can trigger minor floating-point differences in model results (not in STUDENT1-4 CCs which are deterministic). In the past in my operating systems simulations I have used a statistical diff script to allow a margin of difference, but the problem was always deciding what the margin should be. I decided to make these assignments deterministic by using exact instructions on shuffling, seed values, and partitioning training & testing datasets. The second issue is learning algorithm stability. Taking a look at slide 4 that we went over https://faculty.kutztown.edu/parson/fall2022/WekaChapter12.pptx Bagging's "Learning scheme is unstable ... Unstable learner: small change in training data can make big change in model..." (46 instances - instances from early years) split into halves for training and testing is too small. I had qualms about using it. These monthly data sets come from analyses of much bigger datasets that show different climate change patterns for specific months and even weeks. Your assn2 data consists of averages and aggregated counts over many observations with many instances. Check this out! It shows the top two regressors for each student dataset in terms of CC. Both are basically memorizing the training data. They are over-fit. [:-) ~/.../solutions/CSC523f23Regressassn2] head -2 *.sorted.txt.ref | cut -d' ' -f1-3,5,12-17 ==> agend932_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 ==> aroes474_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 ==> ccohe693_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 ==> dpate250_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 ==> eswan071_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 ==> jmcna260_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 ==> jrecc716_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 ==> larce410_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 ==> mling459_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 ==> ncheh472_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 ==> pagan_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 ==> parson_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 ==> pbart313_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 ==> plin983_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 ==> pperr657_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 ==> rwalt267_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 ==> smann624_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 ==> sshah594_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 ==> thall326_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 ==> vmari085_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 ==> wbliz011_CSC523f23Regressassn2.txt.sorted.txt.ref <== DATA 6 maxraw BaggingRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 DATA 8 maxraw KNeighborsRegressor CC 1.000000 RMSQE 0.000000 MABSE 0.000000 The next assignment won't suffer from this limitation, but this is still a worthwhile exercise, both for STUDENT 1-4 CCs (which I have never done in an assignment before), and for LinearRegression models that are intelligible: DATA 4 maxraw REGRESSOR LinearRegression TRAIN # 700 TEST # 700 CC 0.587130 RMSQ E 27.447144 MABSE 24.322490 AMN 43.00 PMN 53.52 AMX 173.00 PMX 115.28 AMN 89.36 PMN 89.36 AMD 87.00 PMD 92.63 ASD 33.91 PSD 19.91 RT_All = 10.966855 * WindSpd_mean + -7.170794 * HMtempC_mean + 0.371054 * wnd_WNW_NW + 116.177000 The biggest problem with that one is that the non-target attributes are not normalized into the range [0.0, 1.0], i.e., (value-min)/(max-min), so those multipliers are not on the same scale. I should have normalized them. I will do a round of that in class. You can see, though, that as the mean wind speed and WNW_NW measures go up, and TempC goes down, RT count goes up. Smooth shows long term trends (slopes), i.e., smoothed changes over time. These values derived via exponential smoothing (we will go over in class) used normalized raw values. DATA 9 maxsmooth REGRESSOR LinearRegression TRAIN # 700 TEST # 700 CC 0.932526 R MSQE 0.013283 MABSE 0.010296 AMN 0.35 PMN 0.36 AMX 0.47 PMX 0.46 AMN 0.41 PMN 0. 41 AMD 0.43 PMD 0.42 ASD 0.04 PSD 0.03 RT_All_smooth = 0.345982 * WindSpd_mean_smooth + -0.642065 * HMtempC_mean_smooth + 0.214916 * wnd_WNW_NW_smooth + 0.581052