CSC 523, Scripting for Data & Analysis, Fall 2023, Assignment 3

CSC 523 - Advanced DataMine for Scientific Data Science, Fall 2023, M 6-8:45 PM, Old Main 158.

SEE START OF CLASS ZOOM RECORDING OF NOV. 6 TO CLARIFY STUDENT 5, 7, & 8,
referring to the data visualization figures below "Addendum 11/4/2023".

Assignment 3 Specification, code is due by end of Monday November 20 via make turnitin on acad or mcgonagall.

Perform the following steps on acad or mcgonagall after logging into your account via putty or ssh:

cd                                    # places you into your login directory
mkdir DataMine              # all of your csc223 projects go into this directory
cd ./DataMine               # makes DataMine your current working directory, it probably already exists
cp ~parson/DataMine/CSC523f23AudioAssn3.problem.zip CSC523f23AudioAssn3.problem.zip
unzip CSC523f23AudioAssn3.problem.zip    # unzips your working copy of the project directory
cd ./CSC523f23AudioAssn3                            # your project working directory

Perform all test execution on mcgonagall to avoid any platform-dependent output differences.
All input and output data files in Assignment 2 are small and reside in your project directory.
Here are the files of interest in this project directory. There are a few you can ignore.
Make sure to answer README.txt in your project directory. A missing README.txt incurs a late charge.

The application domain reference for Assignment 3 is here:
https://faculty.kutztown.edu/parson/spring2020/CSC558Audio1_2020.html

CSC523f23AudioAssn3_generator.py # your work goes here, analyzing correlation coefficients and kappa for regressors & classifiers
CSC523f23AudioAssn3_main.py # Parson's handout code for building & testing models that your generator above provides
makefile                             # the Linux make utility uses this script to direct testing & data viz graphing actions
makelib                            # my library for the makefile
csc523fa2023AudioHarmonicData_32.csv.gz and AmplAvg128.csv.gz are the two input data files.
csc523fa2023AudioHarmonicData_32.csv.gz has ordered attributes ampl1 and freq1, which are the amplitude and frequency
    of the fundamental frequency, normalized to 1.0, and ampl2 and freq2 through ampl32 and freq32 are the fractional amplitudes
    and multiples of ampl1 and freq1 respectively, as extracted by extractAudioFreqARFF17Oct2023.py. The _32 refers to
    aggregating 22050 discrete frequency histograms from 0 through 22,050 cycles per second (hertz) into 32 histogram bin.
AmplAvg128.csv.gz aggregates all amplitudes of 128 histogram bins of the same data into these attributes:

fig 0

Figure 0: First 10 rows of 10,005 from csc523fa2023AudioHarmonicData_32.csv.gz, note the frequency distributions of square etc.
"A square wave consists of a fundamental sine wave (of the same frequency as the square wave) and odd harmonics of the fundamental."
        MeanAmpl    mean of all frequency domain amplitudes
        Median          similar for Median, Population Standard Deviation, Min, and Max
        PStdev
        Min
        Max
        MeanLog     log10() of the above measures, compressing data as seen in Figure 1 below
        MedLog
        SdLog
        MinLog
        MaxLog
        MeanSqr       squaring (**2) the above measures, compressing data as seen in Figure 1 below
        MedSqr
        SdSqr
        MinSqr
        MaxSqr
        tnoign            white noise gain [0.1, 0.25] on a scale of [0.0, 1.0].

As usual, make clean test tests your code and make turnitin turns it into me by the due date.
There is the usual 10% per-day late change after the deadline. make sure to turn in README.txt.

We will go over this Monday October 30 and at least half of the November 6 class will be work time.

Half of your points are for coding in STUDENT requirements of CSC523f23AudioAssn3_generator.py
and half are answer in README.txt. Make sure to answer README.txt in your project directory. A missing
README.txt incurs a late charge.
    *******************************************************************
    # STUDENT 1: 20%, Read & store data sets from CSV files.
    # FOR all file names in sorted(openFileSet)
    #   IF the file name endswith '.gz' (use str.endswith(...))
    #       filehandle = gzip.open(the file name, 'rt') # use 'rt'
    #       https://docs.python.org/3.7/library/gzip.html
    #   ELSE
    #       filehandle = .open(the file name, 'r')
    #   filecsv = csv.reader(filehandle)
    #   LOAD the data set in filecsv into a list of rows, where each
    #       row is a list of cells within that row, where each cell
    #       has been translated via convert(cell) into a float if it
    #       is one, else it remains a string. See def convert() above.
    #       This could be a nested for loop or a nested list comprehension.
    #   IF the file name startswith 'AmplAvg128.csv'
    #       Call PrintCCstats(ccstats, file name, data set header row[0],
    #           remaining data set rows [1:], 'tnoign')
    #   FOR all data keys in inputCSVmap.keys():
    #       IF inputCSVmap[data key][0] == file name
    #           inputCSVmap[data key][0] = data set, where row[0] is the header
    #       IF inputCSVmap[data key][1] == file name
    #           inputCSVmap[data key][1] = data set, where row[1:] is the data
    #   CLOSE THE filehandle (last line within scope of FOR all file names ...
    # CLOSE ccstats AFTER (not within) FOR all file names in sorted(...)
    #   # Comment: row[0] of data set is header, row[1:] of data set is data
    pass    # STUDENT 1 code starts on the next line.

    # STUDENT 2: 10%, Update inputCSVmap[data key][4] with getData 6-tuple
    # FOR all data keys in inputCSVmap.keys():
    # sixTuple = getData(inputCSVmap[data key's][TRAININGDATA],
    #   inputCSVmap[data key's][TESTNGDATA],
    #   inputCSVmap[data key's][TARGET ATTRIBUTE NAME],
    #   inputCSVmap[data key's][list of attributes to discard]
    # inputCSVmap[datakey][4] = sixTuple
    pass    # STUDENT 2 code starts on the next line.
    # STUDENT 3: 10% Make shuffled copy of the big 10K
    # CREATE a variable big10Kshuffled consisting of
    #   row[0] of inputCSVmap['big10K'][TRAINING DATA] and a copy of
    #   row[1:] of inputCSVmap['big10K'][TRAINING DATA] passed through
    #   shuffle() with random_state=220223523
    # CREATE a variable big10Kshuffle6Tuple, passing big10Kshuffled as
    #   the first two arguments (TRAINING and TESTING data) and the
    #   inputCSVmap['big10K'][TARGET ATTRIBUTE NAME],
    #   inputCSVmap['big10K'][list of attributes to discard] as arguments
    #   to getData(), storing its return 6-tuple in big10Kshuffle6Tuple
    pass    # STUDENT 3 code starts on the next line.

    # STUDENT 4: 10% Make copy of inputCSVmap['ampl'] with MEDIAN,MEAN,target
    # RETRIEVE the header from inputCSVmap['ampl'][0] row[0]
    # RETRIEVE the target name from inputCSVmap['ampl'][2]
    # MAKE a new header list: ['Median', 'Min', target name]
    # MAKE a NOT header list of every attribute name in the retrieved header
    #   that is NOT in the new header list
    # Call getData(inputCSVmap['ampl'][TRAINING DATA],
    #   inputCSVmap['ampl'][TESTING DATA], target name,
    #   NOT header list to discard) store returned 6-tuple from getData() in
    #   amplMedianMin6Tuple.
    #   REMAINING STUDENT 5 through 9 WORK IS IN README.txt.
    pass    # STUDENT 4 code starts on the next line.

The other half are answers in REDAME.txt:

SEE START OF CLASS ZOOM RECORDING OF NOV. 6 TO CLARIFY STUDENT 5, 7, & 8.
STUDENT 5 10%: Consult the correlation coefficients for tnoign from
    AmplAvg128.csv.gz in file CSC523f23AudioCCs.ref.
Consult these two decision tree structures in CSC523f23AudioStructured.ref

DATA 5 ampl REGRESSOR decisionTreeRegressor TRAIN # 5002 TEST # 5003
    CC 0.960595 RMSQE 0.012141 MABSE 0.009578 AMN 0.10 PMN 0.11
    AMX 0.25 PMX 0.23 AVG 0.18 PVG 0.17 AMD 0.18 PMD 0.18 ASD 0.04 PSD 0.04
22 LINES IN REGRESSOR TREE
tnoign =
    DECISION TREE PRINT OUT

DATA 6 amplMedianMin REGRESSOR decisionTreeRegressor TRAIN # 5002 TEST # 5003
    CC 0.960595 RMSQE 0.012141 MABSE 0.009578 AMN 0.10 PMN 0.11
    AMX 0.25 PMX 0.23 AVG 0.18 PVG 0.17 AMD 0.18 PMD 0.18 ASD 0.04 PSD 0.04
22 LINES IN REGRESSOR TREE
tnoign =
    DECISION TREE PRINT OUT

Note their respective correlation coeffcient values, root mean squared error,
and mean absolute error measures. Also note this model constructor from
CSC523f23AudioAssn3_generator.py:
    decisionTreeRegressor = DecisionTreeRegressor(min_samples_split=1000,
        random_state=220223523)
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

How do the attributes in these decsion trees in CSC523f23AudioStructured.ref
relate to the CCs of attributes in CSC523f23AudioCCs.ref? What effect
does the reduction of attributes down to [MEDIAN, MIN, tnoign] have
on the CC and error measures of DATA 6's tree? Why do you think that is?

STUDENT ANSWER:

*******************************************************************
STUDENT 6 10%: What effect does the parameter min_samples_split=1000
in the DecisionTreeRegressor constructor call of STUDENT 5 have on
the tree structures in STUDENT 5? EXPERIMENT by setting
min_samples_split=2, which is its default, and examining the resulting
DATA 5 and DATA 6 CCs and tree structures in CSC523f23AudioStructured.tmp
(ignore the diff). Do the correlation coeffcients improve or degrade in
going to a deeper tree? Does human intelligibility improve or degrade
in going to a deeper tree? (Make sure to restore min_samples_split=1000
and get "make test" to pass again.)

STUDENT ANSWER:

*******************************************************************
SEE START OF CLASS ZOOM RECORDING OF NOV. 6 TO CLARIFY STUDENT 5, 7, & 8.
STUDENT 7 10%: Examine Figure 1 in the assignment handout and
the Data 2 and 3 CCs and Linear Regression formulas of
CSC523f23AudioStructured.ref. Why do you think the DATA 3 tree has a CC
that is within 1.2% of Data 2's value, given the loss of most of the
attributes? How does going from the linear regression formula of
DATA 2 to DATA 3 relate to MDL (Minimum Decsription Length), given
the fact that the reduction in CC accuracy is much less than my 10%
threshold for MDL?
In [2]: (0.974525-0.963444)/0.974525
Out[2]: 0.01137066776121701
Consulting CSC523f23AudioCCs.ref may also help with this answer.
Make sure to refer to Figure 1 in your answer.

fig1

STUDENT ANSWER:

*******************************************************************

SEE START OF CLASS ZOOM RECORDING OF NOV. 6 TO CLARIFY STUDENT 5, 7, & 8.
STUDENT 8 10%: Look at these frequency domain plots in
https://faculty.kutztown.edu/parson/spring2020/CSC558Audio1_2020.html.

lazy1_SinOsc_1000_0.9_0.0_0.FREQ.png
lazy1_TriOsc_1000_0.9_0.0_0.FREQ.png
lazy1_SqrOsc_1000_0.9_0.0_0.FREQ.png
lazy1_SawOsc_1000_0.9_0.0_0.FREQ.png
lazy1_PulseOsc_1000_0.9_0.0_0.FREQ.png

lazy1_SinOsc_1001_0.500235007566_0.139453694281_615143.FREQ.png
lazy1_TriOsc_1000_0.550165803172_0.187167353289_513021.FREQ.png
lazy1_SqrOsc_1000_0.743147025456_0.162010730076_822957.FREQ.png
lazy1_SawOsc_1001_0.661812948435_0.18926762076_534545.FREQ.png
lazy1_PulseOsc_1000_0.719825313513_0.153077641397_210081.FREQ.png

The first five in this question are reference waveforms with 0.0
tnoign whitenoise gain, while the last five have tnoign values ranging
from 0.139453694281 to 0.18926762076_534545. Given the difference
in these waveform plots between tnoign=0.0 and tnoign in the range
0.1 to 0.25, why do you think Median and Min are more closely correlated
with tnoign level than Mean, Standard Deviation, Min, or Max?
Put another way, why is CSC523f23AudioCCs.ref ordered the
way it is? Why do Median and Min sit at the top of its CC ordering?

STUDENT ANSWER:

*******************************************************************
STUDENT 9 10%: CSC523f23AudioAssn3.sorted.ref shows the following:
DATA 17 big10Kshuffled CLASSIFIER decisionTreeClassifier
TRAIN # 5002 TEST # 500 3 kappa 1.000000 Correct 5003 %correct 1.000000
DATA 16 big10K CLASSIFIER decisionTreeClassifier
TRAIN # 5002 TEST # 5003 kappa 0.375562 Correct 3003 %correct 0.600240

Why does DATA 17 big10Kshuffled have a kappa of 1.000000 while
DATA 16 big10K have a kappa of only 0.375562 for classifying toosc
(type of signal oscillator), given the fact that they use the same data?
Your answer must include WHY the difference in these two datasets had
this effect on kappa, and not just what your code did to transform the data.

STUDENT ANSWER:

*******************************************************************

Addendum 11/4/2023:

Line graphs of Mean, Median Min, tnoign, and linear model mdl_tnoign with Max along the X axis.

Spike in Max and Mean is at row 7882.

row[N]	Max	MeanAmpl	Median	Min	tnoign	mdl_tnoign
7870	25042437.3	950297.217	699325.655	126909.654	0.21434862	0.20536397
7871	25655746	892270.739	629969.153	90295.4951	0.19599477	0.18599655
7872	26896844.8	749471.748	433803.632	70306.3422	0.13951522	0.13118066
7873	27172923.5	1048322.34	751396.245	91406.4197	0.20567984	0.21993286
7874	27498089.1	798744.191	492841.797	55728.3224	0.1287018	0.14768741
7875	27548207.5	600249.823	371114.463	49482.4463	0.10064269	0.11366952
7876	27604606.9	896487.674	645538.965	28632.0212	0.18691958	0.19037592
7877	28734832.7	1011017.89	753981.969	90616.2651	0.2383981	0.22065588
7878	28869365.8	1015829.92	763037.123	45470.2699	0.21298605	0.22320705
7879	34374944.0	965978.908	658009.179	87987.6064	0.1966708	0.19383431
7880	36893096.0	1079560.63	719765.747	120632.259	0.22875524	0.21107946
7881	40411706.0	1145769.33	854662.656	24356.1706	0.23802755	0.24882437
7882	233107488.0	2624774.41	765606.969	40333.6345	0.23489374	0.2239276
7883	233609256.0	2376112.38	515508.328	21179.7309	0.14997846	0.15403794
7884	245015009.0	2665982.25	675460.05	128038.695	0.19428094	0.19869343
7885	249942747.0	2699865.79	699967.241	107830.465	0.2079127	0.2055519
7886	250068341.0	2536287.11	541088.737	45558.9197	0.15809185	0.16117621
7887	251239967.0	2476912.57	450026.645	11679.147	0.15200352	0.13574122
7888	255681455.0	2857093.87	771225.557	48129.5301	0.2384701	0.22549438
7889	256187276.0	2847472.89	770235.874	84439.4073	0.23796245	0.22520136
7890	256839903.0	2878113.51	811501.204	58980.1562	0.22540067	0.23674583

There is no obvious reason from the toscgn and tnoign parameters.

7870   Path2Waves + "lazy1_SqrOsc_883_0.5019196902857631_0.13410920677627836_774888.wav",
7871   Path2Waves + "lazy1_SqrOsc_885_0.5040248778513396_0.23435531677370172_451831.wav",
7872   Path2Waves + "lazy1_SqrOsc_886_0.6761024165181524_0.20043111620802856_524410.wav",
7873   Path2Waves + "lazy1_SqrOsc_887_0.6174319802004277_0.12557687468457596_950968.wav",
7874   Path2Waves + "lazy1_SqrOsc_888_0.5238622974154334_0.15147334994900363_804612.wav",
7875   Path2Waves + "lazy1_SqrOsc_888_0.6418546582919902_0.19084791411015625_634933.wav",
7876   Path2Waves + "lazy1_SqrOsc_889_0.5362953028226431_0.1762284617042985_504835.wav",
7877   Path2Waves + "lazy1_SqrOsc_892_0.5235271521310194_0.2338311279069861_636038.wav",
7878   Path2Waves + "lazy1_SqrOsc_892_0.6646530432416745_0.22680253281084703_328132.wav",
7879   Path2Waves + "lazy1_SqrOsc_892_0.7217593009337524_0.19029282399381536_818824.wav",
7880   Path2Waves + "lazy1_SqrOsc_892_0.7293327302121537_0.14308256163151722_400709.wav",
7881   Path2Waves + "lazy1_SqrOsc_893_0.6358610041406884_0.11435598256117246_617656.wav",
7882   Path2Waves + "lazy1_SqrOsc_899_0.566973563640619_0.23048518514347255_444139.wav",
7883   Path2Waves + "lazy1_SqrOsc_900_0.6853473203356168_0.10706930783607058_460924.wav",
7884   Path2Waves + "lazy1_SqrOsc_901_0.6473665158447626_0.16574463090753014_590583.wav",
7885   Path2Waves + "lazy1_SqrOsc_901_0.7265478986563011_0.23435666926403104_320089.wav",
7886   Path2Waves + "lazy1_SqrOsc_902_0.6770973417562448_0.23089770748081634_526051.wav",
7887   Path2Waves + "lazy1_SqrOsc_902_0.7348565888313343_0.19430357642905632_822145.wav",
7888   Path2Waves + "lazy1_SqrOsc_904_0.7160828135895405_0.17943542725340125_945078.wav",
7889   Path2Waves + "lazy1_SqrOsc_905_0.7067836319127334_0.1194957976001336_402210.wav",
7890   Path2Waves + "lazy1_SqrOsc_906_0.6227309699714663_0.24224402408964366_375083.wav",

Pseudo-random signals levels from https://faculty.kutztown.edu/parson/spring2020/genwav.py.txt

This call to randint:

            freq = random.randint(100,2000)
            gainosc = random.uniform(.5, .75)
            gainnoise = random.uniform(.1, .25)

Is giving a non-uniform, monotonically increasing distribution of frequencies for the last 833 SqrOsc below.
Signal and white noise gain from random.uniform appears to be fluctuating uniformly.

1168 2000_0.7316099204127263_0.1345648623275206
1169 200_0.575054908577552_0.16707096720619902
1170 200_0.5780057821133431_0.16775737166844037
1171 200_0.6911696008534549_0.10813990571092388
1172 201_0.7283323250127087_0.12307749549715327
...
1876 892_0.7293327302121537_0.14308256163151722
1877 893_0.6358610041406884_0.11435598256117246
1878 899_0.566973563640619_0.23048518514347255
1879 900_0.6853473203356168_0.10706930783607058
1880 901_0.6473665158447626_0.16574463090753014
...
1997 997_0.6114215185047296_0.17181569951464715
1998 998_0.5794539349004512_0.15945904197023156
1999 998_0.6929647312679408_0.15457440558587582
2000 998_0.7197724877009135_0.22186728937530703
2001 999_0.5624164158552742_0.22045281358309338

Correlation coefficient for mdl_tnoign to tnoign confirmed in AmplAvg128MDLgraph.xlsx.

CC does not equal EQUAL:

In [1]: from scipy.stats import pearsonr

In [2]: from math import sqrt

In [9]: actual = [float(i) for i in range(1, 10002, 100)]

In [10]: len(actual)
Out[10]: 101

In [11]: predicted = [value * 100 for value in actual]

In [12]: CC = pearsonr(actual, predicted)

In [13]: CC
Out[13]: (0.9999999999999998, 0.0)

In [15]: from statistics import mean

In [16]: mabse = mean([abs(actual[ix]-predicted[ix]) for ix in range(0,len(actual))])

In [17]: mabse
Out[17]: 495099.0

In [18]: predicted[50]
Out[18]: 500100.0

In [19]: actual[50]
Out[19]: 5001.0

In [20]: predicted[50]-actual[50]
Out[20]: 495099.0

In [21]: from math import sqrt

In [23]: rmsqe = sqrt(mean([((actual[ix]-predicted[ix])**2) for ix in range(0,len(actual))]))

In [24]: rmsqe
Out[24]: 573089.4518319108

Parallel Curves in Value Variation == High Correlation Coefficient

Median, Min, and tnoign for first 100 data rows of AmplAvg128.

lazy1_SqrOsc_1000_0.743147025456_0.162010730076_822957_overlay.jpg