CSC 458 Data Mining & Predictive Analytics I, Spring 2024, Assignment 1

CSC 458 - Data Mining & Predictive Analytics I, Spring 2024, Assignment 1, Classification.
Due by 11:59 PM on Thursday February 15 via D2L. We will have some work time in class.

You will turn in 6 files by the deadline, with a 10% per day penalty and 0 points after
I go over my solution: CSC458S24ClassifyAssn1Turnin.arff, handouttrain.arff, handouttest.arff,
randomtrain.arff, randomtest.arff, and README.txt with your answers.
Their creation is given in the steps below. I prefer you turn in a .zip folder (no .7z)
containing only those files. You can turn individual files if you don't have a zip utility.

As with all assignments except Python Assignment 4, this is a mixture of two things.
    1. Analysis using stable Weka version 3.8.x. Use this free stable version, not a vendor version.
    2. Answering questions in README.c458s24assn1.txt

If you are running on a campus PC with the S:\ network drive mounted, clicking:
    s:\ComputerScience\WEKA\WekaWith2GBcampus.bat
    starts Weka 3.8.6. Save your work on a thumb drive or other persistent drive.
    Campus PCs erase what you save on their drives when you log off.
    Many students download Weka 3.8.x and work on their own PCs or laptops.

Assignment Background

Here is CSC458S24ClassifyAssn1Handout.zip containing three files:

    CSC458S24ClassifyAssn1Handout.arff is the Weka Attribute Relation File Format
        dataset with the starting data.

    README.txt contains questions that you must answer between some steps.

    extractAudioFreqARFF17Oct2023.py is my Python file used to extract data from .wav files.
    This is included in the handout so we can go over example Python scfipts before Assignment 4.

Our application domain is audio signal analysis using data that I generated
I have made changes to the analyses so it will not be a repeat of a CSC523 project.
The dataset remains the same, but the data preparation & analyses differ.

Figure 1 shows a generated 1000 Hz. (cycles per second) audio waveform in the time domain.
This is the third waveform whose data appears in your handout ARFF file.

Figure 1: 1000 Hz sin wave with gain of 0.9 on scale [0.0, 1.0] with white noise of 0.

Here is what this .wav file sounds like (TURN DOWN YOUR AUDIO!).

Figure 2 shows the frequency domain plot of the same .wav file.
These plots act like histograms of frequency components of their signals.
The decibel scale on the right of these plots is logarithmic.

cSin_1000_0.9_0.0Freq.jpg

Figure 2: Frequency domain plot of Figure 1's waveform.

Note that the sin wave is primarily its fundamental frequency at 1000 Hz.

Figures 3 and 4 show the time and frequency domain plots of a 1000 Hz.
pulse wave with a gain of 0.9 and white noise of 0.0, the first instance
in your starting ARFF file.

aPulse_1000_0.9_0.0.jpg

Figure 3: 1000 Hz pulse wave, gain=0.9, white noise=0.0.

aPulse_1000_0.9_0.0_Freq.jpg

Figure 4: Frequency domain plot of Figure 3's waveform.

Note that the pulse contains peaks at both even & odd multiples of the 1000 Hz. fundamental.
Unlike others, it has some low frequency noise. Here is its sound (CAREFUL!).

Figures 5 and 6 show the time and frequency domain plots of a 1000 Hz.
sawtooth wave with a gain of 0.9 and white noise of 0.0, the second instance
in your starting ARFF file.

bSaw_1000_0.9_0.0.jpg

Figure 5: 1000 Hz sawtooth wave, gain=0.9, white noise=0.0.

bSaw_1000_0.9_0.0Freq.jpg

Figure 6: Frequency domain plot of Figure 5's waveform.

Sawtooth also contains peaks at both even & odd multiples of the 1000 Hz. fundamental,
but peaks decay faster than the pulse. Here is its sound (CAREFUL!).

Figures 7 and 8 show the time and frequency domain plots of a 1000 Hz.
square wave with a gain of 0.9 and white noise of 0.0, the fourth instance
in your starting ARFF file.

dSqr_1000_0.9_0.0.jpg

Figure 7: 1000 Hz square wave, gain=0.9, white noise=0.0.

dSqr_1000_0.9_0.0Freq.jpg

Figure 8: Frequency domain plot of Figure 7's waveform.

Square waves contain peaks only at odd multiples of the 1000 Hz. fundamental,
Here is its sound (CAREFUL!).

Figures 9 and 10 show the time and frequency domain plots of a 1000 Hz.
triangle wave with a gain of 0.9 and white noise of 0.0, the fifth instance
in your starting ARFF file.

eTri_1000_0.9_0.0.jpg

Figure 9: 1000 Hz triangle wave, gain=0.9, white noise=0.0.

eTri_1000_0.9_0.0Freq.jpg

Figure 10: Frequency domain plot of Figure 9's waveform.

Triangle also contains peaks at only odd multiples of the 1000 Hz. fundamental,
but peaks decay faster than the square. Here is its sound (CAREFUL!).

Figures 11 and 12 show the time and frequency domain plots of a 1380 Hz.
sin wave with a gain of 0.718 and white noise of 0.228. This .wav has data in
one of the rows of your ARFF file.

We will regress white noise levels in a later assignment. In this assignment
we are classifying waveform type.

fSin_1380_0.718_0.228.jpg

Figure 11: 1380 Hz sin wave, gain=0.718, white noise=0.228.

fSin_1380_0.718_0.228Freq.jpg

Figure 12: Frequency domain plot of Figure 11's waveform.

Signal gain levels range from 0.5 to 0.75 in this dataset except for
the 0-noise 5 reference examples which have signal gain=0.9 on
the scale of [0.0, 1.0]. White noise gains vary from 0.1 to 0.25
except for those 5 0-noise instances.

Here are the data attributes (CSV columns) in your handout ARFF file.

@relation csc523fa2023AudioHarmonicData_32
@attribute ampl1 numeric        Amplitude of fundamental freq normalized to 1.0
@attribute freq1 numeric          Frequency of fundamental normalized to 1.0
@attribute ampl2 numeric        Second strongest amplitude as a fraction of ampl1.
@attribute freq2 numeric          Frequency of ampl2 as a multiple of freq1.
@attribute ampl3 numeric
@attribute freq3 numeric
@attribute ampl4 numeric
@attribute freq4 numeric
... (ampl5 through freq31 elided).
@attribute ampl32 numeric
@attribute freq32 numeric
@attribute toosc {PulseOsc,SawOsc,SinOsc,SqrOsc,TriOsc}    Tagged waveform type
@attribute tfreq numeric                                                            Tagged fundamental frequency
@attribute toscgn numeric                                                           Tagged signal gain
@attribute tnoign numeric                                                            Tagged white noise gain
@attribute tid numeric                                                                   Tagged instance ID
@data
CSV instance-per-row data instances (rows) follow

Figure 13 shows the ampl1 through freq4 and the tagged toosc target attribute values
of the first 10 instances in your handout ARFF file. Note the even and odd harmonics
in the pulse and sawtooth waves, with faster amplitude decay in the latter, and the
odd harmonics in the square and triangle waves, with faster amplitude decay in the latter.
The sin wave is only a fundamental plus some very minor noise.

Figure 13: The first ten instances in CSC458S24ClassifyAssn1Handout.arff.

extractAudioFreqARFF17Oct2023.py extracts .wave file audio data into
CSC458S24ClassifyAssn1Handout.arff.

YOUR WORK ASSIGNMENT.

Any answers to README questions must go into the handout README.txt file.

STEP 1: Load CSC458S24ClassifyAssn1Handout.arff into Weka.
If you use Weka's editor, before careful not to reorder instances until instructed to do so.

STEP 2: After noting how many attributes there are in the Weka Preprocess tab,
run the following filter that I will demo in class. Figure 14 shows a screen shot of this step.

Figure 14: Filter -> unsupervised -> attribute RemoveUseless

Supervised filters attempt to correlate changes to the target attribute,
toosc in this case. Unsupervised filters just make the entered changes.

Hit Apply in the Weka Preprocess tab to run this filter.

README Q1: What attributes did RemoveUseless remove?
Why do you think it removed these?

STEP 3: Go into Weka's Edit window and sort instances in ascending
order on the ampl2 column header. Observe the minimum value comes
to the top. Click OK to save this change.

Manually delete tagged attributes tfreq, toscgn, tnoign, and tid.
This step leaves 63 attributes.
We will classify the remaining tagged target attribute toosc.
Save this dataset as CSC458S24ClassifyAssn1Turnin.arff.
You will turn that in along with your README.txt file with answers
and 4 additional ARRF files that you will create: handouttrain.arff, handouttest.arff,
randomtrain.arff, and randomtest.arff.

Initial testing uses 10-fold cross-validation, which means training models
on 9/10ths of randomly selected instances, testing on the remaining tenth,
then repeating the process 9 more times, and integrating the results.
In later steps we will use different training and testing portions of
CSC458S24ClassifyAssn1Turnin.arff.

STEP 4: In Weka's Classify tab run Classifier -> rules -> ZeroR.

README Q2: Copy and paste the following Weka output.
You must use control-C to copy.

Correctly Classified Instances        N               N.n   %Incorrectly Classified Instances      N               N.n   %Kappa statistic                          N.nMean absolute error                      N.n Root mean squared error                  N.n  Relative absolute error                N      %Root relative squared error            N      %Total Number of Instances              N=== Confusion Matrix === a b    c    d    e   <-- classified as N N    N    N    N |    a = PulseOsc N N    N    N    N |    b = SawOsc N N    N    N    N |    c = SinOsc N N    N    N    N |    d = SqrOsc N N    N    N    N |    e = TriOsc

README Q3: What percentage of the 10,005 instances are correctly classified?
Given the number of distinct waveform types, how much better is this percentage
than a random guess?

Note the kappa value that we will go over in class. It is our primary measure of accuracy
in this assignment.

STEP 5: In Weka's Classify tab run Classifier -> rules -> OneR.

README Q4: Copy and paste the following Weka output.
=== Classifier model (full training set) ===ampl2:    SHOW THE FULL OneR rule here.(N/10005 instances correct)

Correctly Classified Instances        N               N.n   %Incorrectly Classified Instances      N               N.n   %Kappa statistic                          N.nMean absolute error                      N.n Root mean squared error                  N.n  Relative absolute error                N      %Root relative squared error            N      %Total Number of Instances              N=== Confusion Matrix === a b    c    d    e   <-- classified as N N    N    N    N |    a = PulseOsc N N    N    N    N |    b = SawOsc N N    N    N    N |    c = SinOsc N N    N    N    N |    d = SqrOsc N N    N    N    N |    e = TriOsc

README Q5: How well did the OneR model do in terms of % correct and Kappa?

README Q6: What non-target attribute appearing in "Classifier model" above
did Weka use to classify toosc?

README Q7: In terms of the paragraph preceding Figure 13 in the handout,
why do you think OneR selected this attribute?

STEP 6: In Weka's Classify tab run Classifier -> trees -> J48.

README Q8: Copy and paste the following Weka output.

J48 pruned tree------------------SHOW THE DECISION TREE in READMENumber of Leaves :     NSize of the tree :     N

Correctly Classified Instances        N               N.n   %Incorrectly Classified Instances      N               N.n   %Kappa statistic                          N.nMean absolute error                      N.n Root mean squared error                  N.n  Relative absolute error                N      %Root relative squared error            N      %Total Number of Instances              N=== Confusion Matrix === a b    c    d    e   <-- classified as N N    N    N    N |    a = PulseOsc N N    N    N    N |    b = SawOsc N N    N    N    N |    c = SinOsc N N    N    N    N |    d = SqrOsc N N    N    N    N |    e = TriOsc

README Q9: How did J48 do in terms of kappa, and how well
did it do in terms of kappa compared to OneR in Q4 and Q5?

README Q10: What non-target attributes appear in J48's "J48 pruned tree"
decision tree. How does this tree compare with OneR's rule in terms of
intelligibility, i.e., which one is easier to understand.

Decision tree builders use information entropy to decide the binary split points.

The concept of Minimal Description Length (MDL) refers to finding the most
intelligible model (easy to understand) that is no more than 10% less
accurate (in terms of kappa for classification) than the most accurate
model that is presumably less intelligible. The 10% threshold is my rule-of-thumb.

README Q11: Which of these two models above, OneR's rule versus J48's tree,
exhibits the MDL so far? Justify your answer.

STEP 7: Take a look at J48's Confusion matrix of Q8. Each row shows how
instances should have been classified, and each column shows how they were classified.
A perfect diagonal with all entries off the diagonal being zeros is perfect.

README Q12: How many instances in J48's confusion matrix were misclassified.
Describe each non-0 entry that is off the diagonal in terms of:
Number incorrect at that spot, what it should have been, what it was classified as.

STEP 8: Run the filter Filter -> unsupervised -> instance -> RemovePercentage
with the default percentage of 50% and invertSelection set to false.
This remove the first 50% of the instances.

Apply this filter and note the number of remaining instances.
Save this 50% dataset as handouttest.arff.

STEP 9: Execute Undo once and verify that all 10,005 instances have returned.
Run the filter Filter -> unsupervised -> instance -> RemovePercentage
with the default percentage of 50% and invertSelection set to true this time.

This remove the last 50% of the instances.

Apply this filter and note the number of remaining instances.
Save this 50% dataset as handouttrain.arff.

STEP 10: With handouttrain.arff still in Weka, go to the Classify tab and set
Supplied test set to handouttest.arff. We are using handouttrain.arff for training models and
handouttest.arff for testing them with a different dataset to avoid testing on the training data.

Run OneR and J48 on this training / testing configuration.

README Q13: How do OneR and J48 do in terms of % correct and kappa
compared to 10-fold cross validation in Q4 and Q8?

STEP 11: Load CSC458S24ClassifyAssn1Turnin.arff back into Weka.
Run Filter -> unsupervised -> instance -> Randomize
with a default seed of 42. Apply this filter ONLY ONCE to shuffle the instance order.

STEP 12: Run the filter Filter -> unsupervised -> instance -> RemovePercentage
with the default percentage of 50% and invertSelection set to false.
As in STEP 8, his remove the first 50% of the instances.

Apply this filter and note the number of remaining instances.
Save this 50% dataset as randomtest.arff.

STEP 13: Execute Undo once and verify that all 10,005 instances have returned.
Run the filter Filter -> unsupervised -> instance -> RemovePercentage
with the default percentage of 50% and invertSelection set to true this time.

This remove the last 50% of the instances.

Apply this filter and note the number of remaining instances.
Save this 50% dataset as randomtrain.arff.

STEP 14: With randomtrain.arff still in Weka, go to the Classify tab and set
Supplied test set to randomtest.arff. We are using randomtrain.arff for training models and
randomtest.arff for testing them with a different dataset to avoid testing on the training data.

Run OneR and J48 on this training / testing configuration.

README Q14: How do OneR and J48 do in terms of % correct and kappa
compared to ~~10-fold cross validation~~ those test results in Q13 that used handouttrain.arff and handouttest.arff?
(Note: Neither Q13 nor Q14 uses cross-validation. That was a wording mistake, please ignore it.)

README Q15: What accounts in the differences in OneR's and J48's kappa values in Q14
compared to Q13? To figure this out, load handouttrain.arff, handouttest.arff,
randomtrain.arff, and randomtest.arff, one at a time, into the Preprocess tab and
look at the toosc distributions by clicking that attribute and looking at its distribution
in the Selected attribute panel in the upper right of the Preprocess tab.

Each of Q1 through Q15 is worth 6%, giving 90%.
Each file CSC458S24ClassifyAssn1Turnin.arff, handouttrain.arff, handouttest.arff,
randomtrain.arff, and randomtest.arff is worth 2% each if correct.
Make sure to turn in all 5 ARFF files + README.txt with answers.
You can zip a folder with these 6 files or turn them into D2L individually.