Due by 11:59 PM on Thursday February 29 via D2L. We will have some work time in class.

You will turn in 2 files by the deadline, with a 10% per day penalty and 0 points after

I go over my solution:

Their creation is given in the steps below. I prefer you turn in a .zip folder (no .7z)

containing only those files. You can turn individual files if you don't have a zip utility.

As with all assignments except Python Assignment 4, this is a mixture of two things.

1. Analysis using stable Weka version 3.8.x. Use this free stable version, not a vendor version.

2. Answering questions in README.txt

If you are running on a campus PC with the S:\ network drive mounted, clicking:

s:\ComputerScience\WEKA\WekaWith2GBcampus.bat

starts Weka 3.8.6. Save your work on a thumb drive or other persistent drive.

Campus PCs erase what you save on their drives when you log off.

Many students download Weka 3.8.x and work on their own PCs or laptops.

We are using only 10-fold cross-validation testing in this assignment for

simplicity.

Here is

dataset with the starting data for Assignment 1. We will use this data briefly.

analysis in Assignment 2. Regression attempts to predict numeric target

attribute values in each instance based on non-target attributes that may be

numeric or nominal.

extractAudioFreqARFF17Oct2023.py

data from .wav files.

Please refer to the Assignment 1 handout for background on this audio analysis

project. In Assignment 2 we are regressing values for the tagged

gain attribute, with values in the range [0.5, 0.9] on a scale of [0.0, 1.0]

This attribute is present in both of the handout ARFF files, but we did not use it

in Assignment 1. It is the only tagged attribute being used in Assignment 2.

A previous semester's handout, recently updated using SciPy's wav file reader and

fft frequency-domain histogram extraction, serves as a reference.

The main difference between Assignment 1's

whereas the former normalized ampl2 through ampl32 as a fraction of the fundamental

ampl1 scaled to 1.0, and normalized freq2 through freq32 as a multiple of the

fundamental frequency of scaled to 1.0,

contains neither of these scalings. The amplitude and frequency attributes are

the values extracted by the wav file read function and the fft frequency-histogram function.

Figure 1 illustrates that

fundamental frequency ampl1 as generated by the line "freq = random.randint(100,2000)"

in the signal generator, with the off-by-1 difference likely due to rounding.

range [0.5, 0.9] on a scale of [0.0, 1.0] from non-tagged, non-target attributes. There should be

65 attributes at this point including

of your edits.

Use the Weka Preprocess tab to look for your answer. Clicking the ZeroR configuration

entry line and then More is also helpful.

See Evaluating numeric prediction links under Assignment 2 on the course page

for evaluating testing results.

"CfsSubsetEval -P 1 -E 1" with Search Method set to "BestFirst -D 1 -N5" and

hit

Selected attributes: List and number of attribute indices.

... (paste all lines below "Selected attributes")

peak measurement on the left side of the frequency histogram, and the decay rates

of subsequent ampl measures from Assignment 1 Figure 13, why do you think the

ampl

Address

"CorrelationAttributeEval", click "yes" for the pop-up Search Method of

Ranker -- it is ranking attribute correlations to

attributes pasted in your answer relate to the ranking of STEP 3?

Attribute Evaluator (supervised, Class (numeric): 65

Correlation Ranking Filter

Ranked attributes:

N.n n attributeName

N.n n attributeName

N.n n attributeName

N.n n attributeName

N.n n attributeName

NOTE: The

attribute to

attribute, followed by its

and paste these output lines. SimpleLinearRegression is the regression counterpart

to classification's OneR, selecting the most strongly correlated non-target attribute

to correlate with

the most highly correlated attribute in Q4 and Q6, and has the correlation coefficient

improved over ZeroR?

that attempts to use all correlated attributes. Paste this part of its result.

Has the correlation coefficient improved over SimpleLinearRegression?

What do you notice about the multipliers in its linear formula?

(Do not paste this formula, just inspect it.)

toscgn =

n * attributeName +

n * attributeName +

that attempts to use all correlated attributes. The M5P decision

tree divides non-linear data relationships into leaves that

are approximations of linear relationships in the form of

linear expressions. Has the correlation coefficient improved over

LinearRegression? How good is it?

What do you notice about the multipliers in its linear formulas?

What do you notice about the magnitude of values in its decision

tree?

Paste this part of its result.

Correlation coefficient N.n

Mean absolute error N.n

Root mean squared error N.n

Relative absolute error N.n %

Root relative squared error N.n %

Total Number of Instances 10005

correlated non-target attribute values are very high, leading to the apparent

multipliers in the above linear expressions. Second, there are too many attributes.

plus

the following out. How does CC compare to that of Q9?

What do you notice about the multipliers in its linear formulas?

Paste this part of its result.

Correlation coefficient N.n

Mean absolute error N.n

Root mean squared error N.n

Relative absolute error N.n %

Root relative squared error N.n %

Total Number of Instances 10005

with the default formula "(A-MIN)/(MAX-MIN)". Unlike AddExpression which creates

a new named derived attribute, MathExpression with default settings applies its formula

to every non-target attribute in the dataset. Unlike Assignment 1 where amplitudes

and frequencies were normalized against the amplitude and frequency of the

fundamental frequency, "(A-MIN)/(MAX-MIN)" normalizes each non-target attribute

as a fraction of its respective distance between that attribute's individual minimum and

maximum values. This is a within-attribute, across-all-instances normalization, not a

cross-attribute as in Assignment 1's dataset.

Make sure to

been mapped, linearly, into the range [0.0, 1.0], and that

[0.5, 0.9]. We want predictions in

(

SciPy's fft function has a parameter called

"backward" to "forward" value, brings the range of histogram values down into a lower

range that makes linear expression multipliers more visible in Weka without affecting

accuracy of predictions.

results to Weka's Normalize attribute filter.

Normalize filter, not to make multipliers more visible in Weka, but to put them

on the same within-attribute [0.0, 1.0] range so we can compare their multipliers

for improtance.

using a Weka filter.)

and turn it into D2L with your README.txt file by the deadline.

the following output. How do CC, MAE, and RMSE compare to those of Q10?

What do you notice about the multipliers in its linear formulas?

Also, how do the MAE (mean absolute error) and RMSE (root mean squared error)

measures compare to the

Paste this part of its result.

...

Correlation coefficient N.n

Mean absolute error N.n

Root mean squared error N.n

Relative absolute error N.n %

Root relative squared error N.n %

Total Number of Instances 10005

configuration settings and set

of rules are there, and how many linear formulas ("Number of Rules) in M5P for Q11?

How do the complexity of Q11's and Q12's models compare? How do their CCs compare?

Paste this part of its result.

Number of Rules : nn

...

Correlation coefficient N.n

Mean absolute error N.n

Root mean squared error N.n

Relative absolute error N.n %

Root relative squared error N.n %

Total Number of Instances 10005

Number of training instances: 10005

go into its configuration settings and

in the default value for minNumInstances, resulting in more

shallow trees that are possibly easier to read.

How many linear formulas ("Number of Rules) in M5P for Q13 compared to Q11?

How do the complexity of Q11's and Q13's models compare?

How do their CCs compare?

Paste this part of its result.

Number of Rules : n

...

Correlation coefficient N.n

Mean absolute error N.n

Root mean squared error N.n

Relative absolute error N.n %

Root relative squared error N.n %

Total Number of Instances 10005

remove remove tagged attributes

minNumInstances parameter set to

except the top 2 "Selected attributes" for Q4 plus

Run M5P again. What are their CC values? How do these CCs of M5P using

CSC458S24ClassifyAssn1Handout.arff

Why are these

Think about the differences between

Also, look at the attributes in Q14's decision trees compared to the earlier

M5P trees using CSC458S24RegressAssn2Handout.arff's data.

Correlation coefficient N.n (for the 65-attribute dataset after removing the other tagged attributes)

Correlation coefficient N.n (for the 3-attribute dataset after removing all but 3 attributes))