CSC 523 - Advanced DataMine for Scientific Data Science, Fall 2023, M 6-8:45 PM, Old Main 158.

SEE START OF CLASS ZOOM RECORDING OF NOV. 6 TO CLARIFY STUDENT 5, 7, & 8,
    referring to the data visualization figures below "Addendum 11/4/2023".

Assignment 3 Specification,
code is due by end of Monday November 20 via make turnitin on acad or mcgonagall.


Perform the following steps on acad or mcgonagall after logging into your account via putty or ssh:

cd                                    # places you into your login directory
mkdir DataMine              # all of your csc223 projects go into this directory
cd  ./DataMine                 # makes DataMine your current working directory, it probably already exists
cp  ~parson/DataMine/CSC523f23AudioAssn3.problem.zip  CSC523f23AudioAssn3.problem.zip
unzip  CSC523f23AudioAssn3.problem.zip    # unzips your working copy of the project directory
cd  ./CSC523f23AudioAssn3                            # your project working directory

Perform all test execution on mcgonagall to avoid any platform-dependent output differences.
All input and output data files in Assignment 2 are small and reside in your project directory.
Here are the files of interest in this project directory. There are a few you can ignore.
Make sure to answer README.txt in your project directory. A missing README.txt incurs a late charge.
 
The application domain reference for Assignment 3 is here:

https://faculty.kutztown.edu/parson/spring2020/CSC558Audio1_2020.html

CSC523f23AudioAssn3_generator.py  #
your work goes here, analyzing correlation coefficients and kappa for regressors & classifiers
CSC523f23AudioAssn3_main.py 
# Parson's handout code for building & testing models that your generator above provides
makefile                             # the Linux make utility uses this script to direct testing & data viz graphing actions
makelib                              # my library for the makefile
csc523fa2023AudioHarmonicData_32.csv.gz and
AmplAvg128.csv.gz are the two input data files.
csc523fa2023AudioHarmonicData_32.csv.gz has ordered attributes ampl1 and freq1, which are the amplitude and frequency
    of the fundamental frequency, normalized to 1.0, and ampl2 and freq2 through ampl32 and freq32 are the fractional amplitudes
    and multiples of ampl1 and freq1 respectively, as extracted by extractAudioFreqARFF17Oct2023.py. The _32 refers to
    aggregating 22050 discrete frequency histograms from 0 through 22,050 cycles per second (hertz) into 32 histogram bin.

AmplAvg128.csv.gz aggregates all amplitudes  of 128 histogram bins of the same data into these attributes:

fig 0

Figure 0: First 10 rows of
10,005 from csc523fa2023AudioHarmonicData_32.csv.gz, note the frequency distributions of square etc.
"
A square wave consists of a fundamental sine wave (of the same frequency as the square wave) and odd harmonics of the fundamental."
        MeanAmpl    mean of all frequency domain amplitudes
        Median          similar for Median, Population Standard Deviation, Min, and Max
        PStdev
        Min
        Max
        MeanLog      log10() of the above measures, compressing data as seen in Figure 1 below     
        MedLog
        SdLog
        MinLog
        MaxLog
        MeanSqr       squaring (**2)
the above measures, compressing data as seen in Figure 1 below       
        MedSqr
        SdSqr
        MinSqr
        MaxSqr
        tnoign            white noise gain [0.1, 0.25] on a scale of [0.0, 1.0].

As usual, make clean test tests your code and make turnitin turns it into me by the due date.
There is the usual 10% per-day late change after the deadline. make sure to turn in README.txt.

We will go over this Monday October 30 and at least half of the November 6 class will be work time.

Half of your points are for coding in STUDENT requirements of CSC523f23AudioAssn3_generator.py
and half are answer in README.txt. Make sure to answer README.txt in your project directory. A missing
README.txt incurs a late charge.
    *******************************************************************
    # STUDENT 1: 20%, Read & store data sets from CSV files.
    # FOR all file names in sorted(openFileSet)
    #   IF the file name endswith '.gz' (use str.endswith(...))
    #       filehandle = gzip.open(the file name, 'rt') # use 'rt'
    #       https://docs.python.org/3.7/library/gzip.html
    #   ELSE
    #       filehandle = .open(the file name, 'r')
    #   filecsv = csv.reader(filehandle)
    #   LOAD the data set in filecsv into a list of rows, where each
    #       row is a list of cells within that row, where each cell
    #       has been translated via convert(cell) into a float if it
    #       is one, else it remains a string. See def convert() above.
    #       This could be a nested for loop or a nested list comprehension.
    #   IF the file name startswith 'AmplAvg128.csv'
    #       Call PrintCCstats(ccstats, file name, data set header row[0],
    #           remaining data set rows [1:], 'tnoign')
    #   FOR all data keys in inputCSVmap.keys():
    #       IF inputCSVmap[data key][0] == file name
    #           inputCSVmap[data key][0] = data set, where row[0] is the header
    #       IF inputCSVmap[data key][1] == file name
    #           inputCSVmap[data key][1] = data set, where row[1:] is the data
    #   CLOSE THE filehandle (last line within scope of FOR all file names ...
    # CLOSE ccstats AFTER (not within) FOR all file names in sorted(...)
    #   # Comment: row[0] of data set is header, row[1:] of data set is data
    pass    # STUDENT 1 code starts on the next line.

    # STUDENT 2: 10%, Update inputCSVmap[data key][4] with getData 6-tuple
    # FOR all data keys in inputCSVmap.keys():
    # sixTuple = getData(inputCSVmap[data key's][TRAININGDATA],
    #   inputCSVmap[data key's][TESTNGDATA],
    #   inputCSVmap[data key's][TARGET ATTRIBUTE NAME],
    #   inputCSVmap[data key's][list of attributes to discard]
    # inputCSVmap[datakey][4] = sixTuple
    pass    # STUDENT 2 code starts on the next line.
    # STUDENT 3: 10% Make shuffled copy of the big 10K
    # CREATE a variable big10Kshuffled consisting of
    #   row[0] of inputCSVmap['big10K'][TRAINING DATA] and a copy of
    #   row[1:] of inputCSVmap['big10K'][TRAINING DATA] passed through
    #   shuffle() with random_state=220223523
    # CREATE a variable big10Kshuffle6Tuple, passing big10Kshuffled as
    #   the first two arguments (TRAINING and TESTING data) and the
    #   inputCSVmap['big10K'][TARGET ATTRIBUTE NAME],
    #   inputCSVmap['big10K'][list of attributes to discard] as arguments
    #   to getData(), storing its return 6-tuple in big10Kshuffle6Tuple
    pass    # STUDENT 3 code starts on the next line.

    # STUDENT 4: 10% Make copy of inputCSVmap['ampl'] with MEDIAN,MEAN,target
    # RETRIEVE the header from inputCSVmap['ampl'][0] row[0]
    # RETRIEVE the target name from inputCSVmap['ampl'][2]
    # MAKE a new header list: ['Median', 'Min',  target name]
    # MAKE a NOT header list of every attribute name in the retrieved header
    #   that is NOT in the new header list
    # Call getData(inputCSVmap['ampl'][TRAINING DATA],
    #   inputCSVmap['ampl'][TESTING DATA], target name,
    #   NOT header list to discard) store returned 6-tuple from getData() in
    #   amplMedianMin6Tuple.
    #   REMAINING STUDENT 5 through 9 WORK IS IN README.txt.
    pass    # STUDENT 4 code starts on the next line.

The other half are answers in REDAME.txt:

SEE START OF CLASS ZOOM RECORDING OF NOV. 6 TO CLARIFY STUDENT 5, 7, & 8.
STUDENT 5 10%: Consult the correlation coefficients for tnoign from
    AmplAvg128.csv.gz in file CSC523f23AudioCCs.ref.
Consult these two decision tree structures in CSC523f23AudioStructured.ref

DATA 5 ampl REGRESSOR decisionTreeRegressor TRAIN # 5002 TEST # 5003
    CC 0.960595 RMSQE 0.012141 MABSE 0.009578 AMN 0.10 PMN 0.11
    AMX 0.25 PMX 0.23 AVG 0.18 PVG 0.17 AMD 0.18 PMD 0.18 ASD 0.04 PSD 0.04
22 LINES IN REGRESSOR TREE
tnoign =
    DECISION TREE PRINT OUT

DATA 6 amplMedianMin REGRESSOR decisionTreeRegressor TRAIN # 5002 TEST # 5003
    CC 0.960595 RMSQE 0.012141 MABSE 0.009578 AMN 0.10 PMN 0.11
    AMX 0.25 PMX 0.23 AVG 0.18 PVG 0.17 AMD 0.18 PMD 0.18 ASD 0.04 PSD 0.04
22 LINES IN REGRESSOR TREE
tnoign =
    DECISION TREE PRINT OUT

Note their respective correlation coeffcient values, root mean squared error,
and mean absolute error measures. Also note this model constructor from
CSC523f23AudioAssn3_generator.py:
    decisionTreeRegressor = DecisionTreeRegressor(min_samples_split=1000,
        random_state=220223523)
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

How do the attributes in these decsion trees in CSC523f23AudioStructured.ref
relate to the CCs of attributes in CSC523f23AudioCCs.ref? What effect
does the reduction of attributes down to [MEDIAN, MIN, tnoign] have
on the CC and error measures of DATA 6's tree? Why do you think that is?

STUDENT ANSWER:

*******************************************************************
STUDENT 6 10%: What effect does the parameter min_samples_split=1000
in the DecisionTreeRegressor constructor call of STUDENT 5 have on
the tree structures in STUDENT 5? EXPERIMENT by setting
min_samples_split=2, which is its default, and examining the resulting
DATA 5 and DATA 6 CCs and tree structures in CSC523f23AudioStructured.tmp
(ignore the diff). Do the correlation coeffcients improve or degrade in
going to a deeper tree? Does human intelligibility improve or degrade
in going to a deeper tree? (Make sure to restore min_samples_split=1000
and get "make test" to pass again.)

STUDENT ANSWER:

*******************************************************************
SEE START OF CLASS ZOOM RECORDING OF NOV. 6 TO CLARIFY STUDENT 5, 7, & 8.
STUDENT 7 10%: Examine Figure 1 in the assignment handout and
the Data 2 and 3 CCs and Linear Regression formulas of
CSC523f23AudioStructured.ref. Why do you think the DATA 3 tree has a CC
that is within 1.2% of Data 2's value, given the loss of most of the
attributes? How does going from the linear regression formula of
DATA 2 to DATA 3 relate to MDL (Minimum Decsription Length), given
the fact that the reduction in CC accuracy is much less than my 10%
threshold for MDL?
In [2]: (0.974525-0.963444)/0.974525
Out[2]: 0.01137066776121701
Consulting CSC523f23AudioCCs.ref may also help with this answer.
Make sure to refer to Figure 1 in your answer.

fig1

STUDENT ANSWER:

*******************************************************************

SEE START OF CLASS ZOOM RECORDING OF NOV. 6 TO CLARIFY STUDENT 5, 7, & 8.
STUDENT 8 10%: Look at these frequency domain plots in
https://faculty.kutztown.edu/parson/spring2020/CSC558Audio1_2020.html.

lazy1_SinOsc_1000_0.9_0.0_0.FREQ.png
lazy1_TriOsc_1000_0.9_0.0_0.FREQ.png
lazy1_SqrOsc_1000_0.9_0.0_0.FREQ.png
lazy1_SawOsc_1000_0.9_0.0_0.FREQ.png
lazy1_PulseOsc_1000_0.9_0.0_0.FREQ.png

lazy1_SinOsc_1001_0.500235007566_0.139453694281_615143.FREQ.png
lazy1_TriOsc_1000_0.550165803172_0.187167353289_513021.FREQ.png
lazy1_SqrOsc_1000_0.743147025456_0.162010730076_822957.FREQ.png
lazy1_SawOsc_1001_0.661812948435_0.18926762076_534545.FREQ.png
lazy1_PulseOsc_1000_0.719825313513_0.153077641397_210081.FREQ.png

The first five in this question are reference waveforms with 0.0
tnoign whitenoise gain, while the last five have tnoign values ranging
from 0.139453694281 to 0.18926762076_534545. Given the difference
in these waveform plots between tnoign=0.0 and tnoign in the range
0.1 to 0.25, why do you think Median and Min are more closely correlated
with tnoign level than Mean, Standard Deviation, Min, or Max?
Put another way, why is CSC523f23AudioCCs.ref ordered the
way it is? Why do Median and Min sit at the top of its CC ordering?

STUDENT ANSWER:

*******************************************************************
STUDENT 9 10%: CSC523f23AudioAssn3.sorted.ref shows the following:
DATA 17 big10Kshuffled CLASSIFIER decisionTreeClassifier
    TRAIN # 5002 TEST # 500 3 kappa 1.000000 Correct 5003 %correct 1.000000
DATA 16 big10K CLASSIFIER decisionTreeClassifier
    TRAIN # 5002 TEST # 5003 kappa 0.375562 Correct 3003 %correct 0.600240

Why does DATA 17 big10Kshuffled have a kappa of 1.000000 while
DATA 16 big10K have a kappa of only 0.375562 for classifying toosc
(type of signal oscillator), given the fact that they use the same data?
Your answer must include WHY the difference in these two datasets had
this effect on kappa, and not just what your code did to transform the data.

STUDENT ANSWER:

*******************************************************************


Addendum 11/4/2023:

AmplAvg128MDLgraph.jpg
Line graphs of Mean, Median Min, tnoign, and linear model mdl_tnoign with Max along the X axis.

Spike in Max and Mean is at row 7882.
row[N] Max MeanAmpl Median Min tnoign mdl_tnoign
7870 25042437.3 950297.217 699325.655 126909.654 0.21434862 0.20536397
7871 25655746 892270.739 629969.153 90295.4951 0.19599477 0.18599655
7872 26896844.8 749471.748 433803.632 70306.3422 0.13951522 0.13118066
7873 27172923.5 1048322.34 751396.245 91406.4197 0.20567984 0.21993286
7874 27498089.1 798744.191 492841.797 55728.3224 0.1287018 0.14768741
7875 27548207.5 600249.823 371114.463 49482.4463 0.10064269 0.11366952
7876 27604606.9 896487.674 645538.965 28632.0212 0.18691958 0.19037592
7877 28734832.7 1011017.89 753981.969 90616.2651 0.2383981 0.22065588
7878 28869365.8 1015829.92 763037.123 45470.2699 0.21298605 0.22320705
7879 34374944.0 965978.908 658009.179 87987.6064 0.1966708 0.19383431
7880 36893096.0 1079560.63 719765.747 120632.259 0.22875524 0.21107946
7881 40411706.0 1145769.33 854662.656 24356.1706 0.23802755 0.24882437
7882 233107488.0 2624774.41 765606.969 40333.6345 0.23489374 0.2239276
7883 233609256.0 2376112.38 515508.328 21179.7309 0.14997846 0.15403794
7884 245015009.0 2665982.25 675460.05 128038.695 0.19428094 0.19869343
7885 249942747.0 2699865.79 699967.241 107830.465 0.2079127 0.2055519
7886 250068341.0 2536287.11 541088.737 45558.9197 0.15809185 0.16117621
7887 251239967.0 2476912.57 450026.645 11679.147 0.15200352 0.13574122
7888 255681455.0 2857093.87 771225.557 48129.5301 0.2384701 0.22549438
7889 256187276.0 2847472.89 770235.874 84439.4073 0.23796245 0.22520136
7890 256839903.0 2878113.51 811501.204 58980.1562 0.22540067 0.23674583








There is no obvious reason from the toscgn and tnoign parameters.

 
7870   Path2Waves + "lazy1_SqrOsc_883_0.5019196902857631_0.13410920677627836_774888.wav",
 7871   Path2Waves + "lazy1_SqrOsc_885_0.5040248778513396_0.23435531677370172_451831.wav",
 7872   Path2Waves + "lazy1_SqrOsc_886_0.6761024165181524_0.20043111620802856_524410.wav",
 7873   Path2Waves + "lazy1_SqrOsc_887_0.6174319802004277_0.12557687468457596_950968.wav",
 7874   Path2Waves + "lazy1_SqrOsc_888_0.5238622974154334_0.15147334994900363_804612.wav",
 7875   Path2Waves + "lazy1_SqrOsc_888_0.6418546582919902_0.19084791411015625_634933.wav",
 7876   Path2Waves + "lazy1_SqrOsc_889_0.5362953028226431_0.1762284617042985_504835.wav",
 7877   Path2Waves + "lazy1_SqrOsc_892_0.5235271521310194_0.2338311279069861_636038.wav",
 7878   Path2Waves + "lazy1_SqrOsc_892_0.6646530432416745_0.22680253281084703_328132.wav",
 7879   Path2Waves + "lazy1_SqrOsc_892_0.7217593009337524_0.19029282399381536_818824.wav",
 7880   Path2Waves + "lazy1_SqrOsc_892_0.7293327302121537_0.14308256163151722_400709.wav",
 7881   Path2Waves + "lazy1_SqrOsc_893_0.6358610041406884_0.11435598256117246_617656.wav",
 7882   Path2Waves + "lazy1_SqrOsc_899_0.566973563640619_0.23048518514347255_444139.wav",
 7883   Path2Waves + "lazy1_SqrOsc_900_0.6853473203356168_0.10706930783607058_460924.wav",
 7884   Path2Waves + "lazy1_SqrOsc_901_0.6473665158447626_0.16574463090753014_590583.wav",
 7885   Path2Waves + "lazy1_SqrOsc_901_0.7265478986563011_0.23435666926403104_320089.wav",
 7886   Path2Waves + "lazy1_SqrOsc_902_0.6770973417562448_0.23089770748081634_526051.wav",
 7887   Path2Waves + "lazy1_SqrOsc_902_0.7348565888313343_0.19430357642905632_822145.wav",
 7888   Path2Waves + "lazy1_SqrOsc_904_0.7160828135895405_0.17943542725340125_945078.wav",
 7889   Path2Waves + "lazy1_SqrOsc_905_0.7067836319127334_0.1194957976001336_402210.wav",
 7890   Path2Waves + "lazy1_SqrOsc_906_0.6227309699714663_0.24224402408964366_375083.wav",

Pseudo-random signals levels from https://faculty.kutztown.edu/parson/spring2020/genwav.py.txt

This call to randint:

            freq = random.randint(100,2000)
            gainosc = random.uniform(.5, .75)
            gainnoise = random.uniform(.1, .25)

Is giving a non-uniform, monotonically increasing distribution of frequencies for the last 833 SqrOsc below.
Signal and white noise gain from random.uniform appears to be fluctuating uniformly.

  1168  2000_0.7316099204127263_0.1345648623275206
  1169  200_0.575054908577552_0.16707096720619902
  1170  200_0.5780057821133431_0.16775737166844037
  1171  200_0.6911696008534549_0.10813990571092388
  1172  201_0.7283323250127087_0.12307749549715327
...
  1876  892_0.7293327302121537_0.14308256163151722
  1877  893_0.6358610041406884_0.11435598256117246
  1878  899_0.566973563640619_0.23048518514347255
  1879  900_0.6853473203356168_0.10706930783607058
  1880  901_0.6473665158447626_0.16574463090753014
...
  1997  997_0.6114215185047296_0.17181569951464715
  1998  998_0.5794539349004512_0.15945904197023156
  1999  998_0.6929647312679408_0.15457440558587582
  2000  998_0.7197724877009135_0.22186728937530703
  2001  999_0.5624164158552742_0.22045281358309338

Correlation coefficient for mdl_tnoign to tnoign confirmed in AmplAvg128MDLgraph.xlsx.

CC does not equal EQUAL:

In [1]: from scipy.stats import pearsonr                                       

In [2]: from math import sqrt                                                  

In [9]: actual = [float(i) for i in range(1, 10002, 100)]                      

In [10]: len(actual)                                                           
Out[10]: 101

In [11]: predicted = [value * 100 for value in actual]                         

In [12]: CC = pearsonr(actual, predicted)                                      

In [13]: CC                                                                    
Out[13]: (0.9999999999999998, 0.0)

In [15]: from statistics  import mean                                          

In [16]: mabse = mean([abs(actual[ix]-predicted[ix]) for ix in range(0,len(actual))])                   

In [17]: mabse                                                                                          
Out[17]: 495099.0

In [18]: predicted[50]                                                                                  
Out[18]: 500100.0

In [19]: actual[50]                                                                                     
Out[19]: 5001.0

In [20]: predicted[50]-actual[50]                                                                       
Out[20]: 495099.0

In [21]: from math import sqrt                                                                          

In [23]: rmsqe = sqrt(mean([((actual[ix]-predicted[ix])**2) for ix in range(0,len(actual))]))           

In [24]: rmsqe                                                                                          
Out[24]: 573089.4518319108

Parallel Curves in Value Variation == High Correlation Coefficient

MedianMinTnoign.jpg

Median, Min, and tnoign for first 100 data rows of AmplAvg128.

lazy1_SqrOsc_1000_0.9_0.0_0.overlay.jpg


lazy1_SqrOsc_1000_0.743147025456_0.162010730076_822957_overlay.jpg