CSC 523 - Scripting for Data Science, Fall 2022, Assignment 3 (official).

Assignment 3 due 11:59 PM Friday October 28 via make turnitin. You must test on mcgonagall.

Added Figures 11 & 12 on 10/23/2022, reviewed them previous Monday's class.

1. To get the assignment on mcgonagall (ssh mcgonagall or ssh kupapcsit01 from acad):
    mkdir ~/DataMine   #  (This is in your home directory; it may exists if you took one of my data science courses.)
    cd ~/DataMine
    cp ~parson/DataMine/CSC523Fall2022Classify.problem.zip CSC523Fall2022Classify.problem.zip
    unzip CSC523Fall2022Classify.problem.zip
    cd ./CSC523Fall2022Classify
    make clean test
   
# This fails on handout code due to testing diffs from missing data derivations and models.
    # It is working code.


These are the attributes we will use in this analysis in month_aggregate_HMS_goodyears.arff.gz,
which is a monthly aggregate of daily aggregates of (mostly 1-hour) observation periods.

year                                                    1976-2021
month                                                  8-12
HMtempC_mean                                 mean for month of temp Celsius during observation times
WindSpd_mean                                  same for wind speed in km/hour
HMtempC_median                              median for month
WindSpd_median
HMtempC_pstdv                                population standard deviation
WindSpd_pstdv
HMtempC_min                                   minimum & maximum
WindSpd_min
HMtempC_max
WindSpd_max
wndN                                                 tally of North winds for all observations in the month, etc.
wndNNE
wndNE
wndENE
wndE
wndESE
wndSE
wndSSE
wndS
wndSSW
wndSW
wndWSW
wndW
wndWNW
wndNW
wndNNW
wndUNK
HMtempC_24_mean                    Changes in magnitude (absolute value of change) over 24, 48, and 72 hours
HMtempC_48_mean
HMtempC_72_mean
HMtempC_24_median
HMtempC_48_median
HMtempC_72_median
HMtempC_24_pstdv
HMtempC_48_pstdv
HMtempC_72_pstdv
HMtempC_24_min                    The min & max are their signed values.
HMtempC_48_min
HMtempC_72_min
HMtempC_24_max
HMtempC_48_max
HMtempC_72_max
SS_All                                        Tally of sharp-shinned hawk observations during each month 8-12, 1976-2021.

2. Edit CSC523Fall2022Classify_generator.py and search for upper-case STUDENT comments. The requirements with % point values are there. There is also a file README.txt in which you must answer questions. The code is worth 30% of the assignment and README.txt the remaining 70%. You can work on the README.txt before completing the code if you want since it uses the Reference files of expected output.

MAKE SURE to search for all upper-case STUDENT comments, especially ones starting with "# STUDENTS MUST uncomment this code". Some working code is commented out until you define derived data views and models (regressors & classifiers) bound to the specified Python variables.

Here are Figures relating to questions in README.txt.

F1

Figure 1: Initial distribution of numeric target attribute SS_All (NumericTarget)

F2

Figure 2: Distribution of Log10Target (targetdataLog10 STUDENT 1), log10(SS_All) values
Take 10Maximum to get the max of Figure 1

F3

Figure 3: Distribution of SqrtTarget (targetdataSqrt), sqrt(SS_All) values
Take Maximum2 to get the max of Figure 1.

F4

Figure 4: Equal-value-range discretization distribution of SS_All in targetdataNomRange

F5

Figure 5: Equal-bin-frequency discretization distribution of SS_All in targetdataNomEqFreq (STUDENT 2)

F6
Figure 6: Y=Log10Target of Figure 2 as a function of X=NumericTarget in Figure 1.

Following Figures Added after Handout Distributed Late Morning of 10/12 to help with README.

F7

Figure 7: M5P model of SS_All, original counts by year, month
7 Linear Expressions at the Leaves
Correlation coefficient                  0.9076
Mean absolute error                    407.9356
Root mean squared error                736.9916
Relative absolute error                 31.0553 %
Root relative squared error             42.2449 %
Total Number of Instances              226

F8

Figure 8: Same Data as Figure 7, Final Years

F9

Figure 9: M5P model of SS_All_Log10, original counts by year, month
8 Linear Expressions at the Leaves
Correlation coefficient                  0.965
Mean absolute error                      0.2364
Root mean squared error                  0.3031
Relative absolute error                 23.9819 %
Root relative squared error             26.4643 %
Total Number of Instances              226

F10

Figure 10: Same Data as Figure 9, Final Years

F11

FIGURE 11: SS_ALL observations as a function of discrete month 8 through 12.

F12

FIGURE 12: SS_ALL_Log10 observations as a function of discrete month 8 through 12.

3. We will go over this handout code in class on October 17. When you are ready to test your code, type make clean test in the code directory. A successful test run looks like this:

$ make clean test
/bin/rm -f *.o *.class .jar core *.exe *.obj *.pyc __pycache__/*.pyc
/bin/rm -f junk* *.pyc CSC523Fall2022ClassifyTrace.txt CSC523Fall2022ClassifyOut.txt
/bin/rm -f *.tmp *.o *.dif *.out *.csv __pycache__/*
/bin/rm -f CSC523Fall2022ClassifyTrace.txt CSC523Fall2022ClassifyOut.txt
/usr/local/bin/python3.7 CSC523Fall2022Classify_main.py CSC523Fall2022ClassifyTrace.txt CSC523Fall2022Classify_generator month_aggregate_HMS_goodyears.arff '' > CSC523Fall2022ClassifyOut.txt
diff --ignore-trailing-space --strip-trailing-cr CSC523Fall2022ClassifyOut.txt CSC523Fall2022ClassifyOut.txt.ref > CSC523Fall2022ClassifyOut.txt.dif
sed -e 's/^TIME[^D]*DATA/DATA/' < CSC523Fall2022ClassifyTrace.txt > CSC523Fall2022ClassifyTrace.tmp
diff --ignore-trailing-space --strip-trailing-cr CSC523Fall2022ClassifyTrace.tmp CSC523Fall2022ClassifyTrace.txt.ref > CSC523Fall2022ClassifyTrace.txt.dif
grep DATA CSC523Fall2022ClassifyOut.txt.ref | sort -rn -t' ' -k13 | sed -e 's/^DATA/\nDATA/' | grep REGRESSOR > CSC523Fall2022ClassifyOut.sorted.txt.ref
echo "" >> CSC523Fall2022ClassifyOut.sorted.txt.ref
grep DATA CSC523Fall2022ClassifyOut.txt.ref | sort -rn -t' ' -k17 | sed -e 's/^DATA/\nDATA/' | grep CLASSIFIER >> CSC523Fall2022ClassifyOut.sorted.txt.ref

4. Make sure to answer all questions in README.txt. You can do this before your coding is working.

5. After make clean test works without errors (terminates without an error message) and README.txt is answered, type make turnitin and hit Enter at the prompt to get your work to me before the deadline.

If you make subsequent changes and make clean test still passes, you can run make turnitin again and over-write your previous submission. Note that this is not the "turnin" script you may have used in other courses.

There is a 10% per day penalty for late assignments in my courses and I cannot grant any points after I go over a solution.

$ make turnitin
/bin/rm -f *.o *.class .jar core *.exe *.obj *.pyc __pycache__/*.pyc
/bin/rm -f junk* *.pyc CSC523Fall2022ClassifyTrace.txt CSC523Fall2022ClassifyOut.txt
/bin/rm -f *.tmp *.o *.dif *.out *.csv __pycache__/*

Do you really want to send CSC523Fall2022Classify to Professor Parson?
Hit Enter to continue, control-C to abort.


/bin/bash -c "cd .. ; /bin/chmod 700 .                  ; \
        /bin/tar cvf ./CSC523Fall2022Classify_parson.tar CSC523Fall2022Classify      ; \
        /bin/gzip ./CSC523Fall2022Classify_parson.tar                    ; \
        /bin/chmod 666 ./CSC523Fall2022Classify_parson.tar.gz            ; \
        /bin/mv ./CSC523Fall2022Classify_parson.tar.gz ~parson/incoming"
CSC523Fall2022Classify/
CSC523Fall2022Classify/__pycache__/
CSC523Fall2022Classify/makelib
CSC523Fall2022Classify/diffarff.py
CSC523Fall2022Classify/plotcsv_1_2.py
CSC523Fall2022Classify/arfflib_3_3.py
CSC523Fall2022Classify/__init__.py
CSC523Fall2022Classify/month_aggregate_HMS_goodyears.arff
CSC523Fall2022Classify/CSC523Fall2022Classify_main.py
CSC523Fall2022Classify/CSC523Fall2022ClassifyOut.txt.ref
CSC523Fall2022Classify/CSC523Fall2022ClassifyTrace.txt.ref
CSC523Fall2022Classify/CSC523Fall2022ClassifyOut.sorted.txt.ref
CSC523Fall2022Classify/CSC523Fall2022Classify_generator.py
CSC523Fall2022Classify/F2022Assn3Keepers.arff.gz
CSC523Fall2022Classify/makefile
CSC523Fall2022Classify/README.txt