CSC 523 - Scripting for Data Science, Fall 2022, Assignment 2 (official).

Assignment 2 due 11:59 PM Thursday October 13 via make turnitin. You must test on mcgonagall.
Oct 10 is a KU holiday and I will be on vacation the 11th & 13th, I will have Zoom office hours Wed the 12th.
The 11th is our scheduled class. I will post a Zoom video ahead of time.


1. To get the assignment on mcgonagall (ssh mcgonagall or ssh kupapcsit01 from acad):
    mkdir ~/DataMine   #  (This is in your home directory; it may exists if you took one of my data science courses.)
    cd ~/DataMine
    cp ~parson/DataMine/CSC523HMfall2022.problem.zip CSC523HMfall2022.problem.zip
    unzip CSC523HMfall2022.problem.zip
    cd ./CSC523HMfall2022
    make clean test        # This fails on handout code as follows:

$ make clean test
/bin/rm -f *.o *.class .jar core *.exe *.obj *.pyc __pycache__/*.pyc
/bin/rm -f junk* *.pyc CSC523HMfall2022Trace.txt CSC523HMfall2022Out.txt
/bin/rm -f *.tmp *.o *.dif *.out *.csv __pycache__/*
/bin/rm -f CSC523HMfall2022Trace.txt CSC523HMfall2022Out.txt
/usr/local/bin/python3.7 CSC523HMfall2022_main.py CSC523HMfall2022Trace.txt CSC523HMfall2022_generator day_aggregate_HMS_1976_2021_kupapcsit01.arff '' > CSC523HMfall2022Out.
txt
diff --ignore-trailing-space --strip-trailing-cr CSC523HMfall2022Out.txt CSC523HMfall2022Out.txt.ref > CSC523HMfall2022Out.txt.dif
make: *** [test] Error 1

The handout code runs correctly. It is missing models and data that you have to add.
CSC523HMfall2022Out.txt.dif looks like this:

21a22,117
> *******************************************************
> DATA 8 y76dAug1TOtempcTestEqTrain REGRESSOR BigRandomForestRegressorMAE TRAIN
# 5218 TEST # 5218 CORR_COEF 0.981508 RMSQERROR 1.632223 MABSERROR 1.195834
> *******************************************************
> *******************************************************
> DATA 9 y76dAug1TOtempcTestEqTrain REGRESSOR ShallowRandomForestRegressorMAE TR
AIN # 5218 TEST # 5218 CORR_COEF 0.866267 RMSQERROR 4.206889 MABSERROR 3.312003
> *******************************************************
> *******************************************************
> DATA 10 y76dAug1TOtempcTestEqTrain REGRESSOR RidgeRegressor TRAIN # 5218 TEST
# 5218 CORR_COEF 0.864068 RMSQERROR 4.232113 MABSERROR 3.354357
> *******************************************************

There are 96 lines of missing output, which come to 32 missing models:

grep DATA CSC523HMfall2022Out.txt CSC523HMfall
2022Out.txt.ref  | cut -d" " -f1-2
CSC523HMfall2022Out.txt:DATA 1
CSC523HMfall2022Out.txt:DATA 2
CSC523HMfall2022Out.txt:DATA 3
CSC523HMfall2022Out.txt:DATA 4
CSC523HMfall2022Out.txt:DATA 5
CSC523HMfall2022Out.txt:DATA 6
CSC523HMfall2022Out.txt:DATA 7
CSC523HMfall2022Out.txt.ref:DATA 1
CSC523HMfall2022Out.txt.ref:DATA 2
CSC523HMfall2022Out.txt.ref:DATA 3
CSC523HMfall2022Out.txt.ref:DATA 4
CSC523HMfall2022Out.txt.ref:DATA 5
CSC523HMfall2022Out.txt.ref:DATA 6
CSC523HMfall2022Out.txt.ref:DATA 7
CSC523HMfall2022Out.txt.ref:DATA 8
CSC523HMfall2022Out.txt.ref:DATA 9
CSC523HMfall2022Out.txt.ref:DATA 10
CSC523HMfall2022Out.txt.ref:DATA 11
CSC523HMfall2022Out.txt.ref:DATA 12
CSC523HMfall2022Out.txt.ref:DATA 13
CSC523HMfall2022Out.txt.ref:DATA 14
CSC523HMfall2022Out.txt.ref:DATA 15
CSC523HMfall2022Out.txt.ref:DATA 16
CSC523HMfall2022Out.txt.ref:DATA 17
CSC523HMfall2022Out.txt.ref:DATA 18
CSC523HMfall2022Out.txt.ref:DATA 19
CSC523HMfall2022Out.txt.ref:DATA 20
CSC523HMfall2022Out.txt.ref:DATA 21
CSC523HMfall2022Out.txt.ref:DATA 22
CSC523HMfall2022Out.txt.ref:DATA 23
CSC523HMfall2022Out.txt.ref:DATA 24
CSC523HMfall2022Out.txt.ref:DATA 25
CSC523HMfall2022Out.txt.ref:DATA 26
CSC523HMfall2022Out.txt.ref:DATA 27
CSC523HMfall2022Out.txt.ref:DATA 28
CSC523HMfall2022Out.txt.ref:DATA 29
CSC523HMfall2022Out.txt.ref:DATA 30
CSC523HMfall2022Out.txt.ref:DATA 31
CSC523HMfall2022Out.txt.ref:DATA 32
CSC523HMfall2022Out.txt.ref:DATA 33
CSC523HMfall2022Out.txt.ref:DATA 34
CSC523HMfall2022Out.txt.ref:DATA 35
CSC523HMfall2022Out.txt.ref:DATA 36
CSC523HMfall2022Out.txt.ref:DATA 37
CSC523HMfall2022Out.txt.ref:DATA 38
CSC523HMfall2022Out.txt.ref:DATA 39

2. Edit CSC523HMfall2022_generator.py and search for upper-case STUDENT comments. The requirements with % point values are there. There is also a file README.txt in which you must answer questions. The code is worth 70% of the assignment and README.txt the remaining 30%.

    See the course page on Notepad++ if you are new to our Linux systems, or you can log in and use the vim or emacs editor.
    Scroll down to Basic UNIX Information on this page.

3. We will go over the handout code in class on September 26. When you are ready to test your code, type make clean test in the code directory. A successful test run looks like this:

$ make clean test
/bin/rm -f *.o *.class .jar core *.exe *.obj *.pyc __pycache__/*.pyc
/bin/rm -f junk* *.pyc CSC523HMfall2022Trace.txt CSC523HMfall2022Out.txt
/bin/rm -f *.tmp *.o *.dif *.out *.csv __pycache__/*
/bin/rm -f CSC523HMfall2022Trace.txt CSC523HMfall2022Out.txt
/usr/local/bin/python3.7 CSC523HMfall2022_main.py CSC523HMfall2022Trace.txt CSC523HMfall2022_generator day_aggregate_HMS_1976_2021_kupapcsit01.arff '' > CSC523HMfall2022Out.
txt
diff --ignore-trailing-space --strip-trailing-cr CSC523HMfall2022Out.txt CSC523HMfall2022Out.txt.ref > CSC523HMfall2022Out.txt.dif
sed -e 's/^TIME[^D]*DATA/DATA/' < CSC523HMfall2022Trace.txt > CSC523HMfall2022Trace.tmp
diff --ignore-trailing-space --strip-trailing-cr CSC523HMfall2022Trace.tmp CSC523HMfall2022Trace.txt.ref > CSC523HMfall2022Trace.txt.dif
grep DATA CSC523HMfall2022Out.txt.ref | sort -rn -t' ' -k13 | sed -e 's/^DATA/\nDATA/' > CSC523HMfall2022Out.sorted.txt.ref
[:-) ~/.../solutions/CSC523HMfall2022] make clean test
/bin/rm -f *.o *.class .jar core *.exe *.obj *.pyc __pycache__/*.pyc
/bin/rm -f junk* *.pyc CSC523HMfall2022Trace.txt CSC523HMfall2022Out.txt
/bin/rm -f *.tmp *.o *.dif *.out *.csv __pycache__/*
/bin/rm -f CSC523HMfall2022Trace.txt CSC523HMfall2022Out.txt
/usr/local/bin/python3.7 CSC523HMfall2022_main.py CSC523HMfall2022Trace.txt CSC523HMfall2022_generator day_aggregate_HMS_1976_2021_kupapcsit01.arff '' > CSC523HMfall2022Out.
txt
diff --ignore-trailing-space --strip-trailing-cr CSC523HMfall2022Out.txt CSC523HMfall2022Out.txt.ref > CSC523HMfall2022Out.txt.dif
sed -e 's/^TIME[^D]*DATA/DATA/' < CSC523HMfall2022Trace.txt > CSC523HMfall2022Trace.tmp
diff --ignore-trailing-space --strip-trailing-cr CSC523HMfall2022Trace.tmp CSC523HMfall2022Trace.txt.ref > CSC523HMfall2022Trace.txt.dif
grep DATA CSC523HMfall2022Out.txt.ref | sort -rn -t' ' -k13 | sed -e 's/^DATA/\nDATA/' > CSC523HMfall2022Out.sorted.txt.ref

4. Make sure to answer all questions in README.txt. You can do this before your coding is working.

5. After make clean test works without errors (terminates without an error message) and README.txt is answered, type make turnitin and hit Enter at the prompt to get your work to me before the deadline.

If you make subsequent changes and make clean test still passes, you can run make turnitin again and over-write your previous submission. Note that this is not the "turnin" script you may have used in other courses.

There is a 10% per day penalty for late assignments in my courses and I cannot grant any points after I go over a solution.

$ make turnitin
/bin/rm -f *.o *.class .jar core *.exe *.obj *.pyc __pycache__/*.pyc
/bin/rm -f junk* *.pyc CSC523HMfall2022Trace.txt CSC523HMfall2022Out.txt
/bin/rm -f *.tmp *.o *.dif *.out *.csv __pycache__/*

Do you really want to send CSC523HMfall2022 to Professor Parson?
Hit Enter to continue, control-C to abort.


/bin/bash -c "cd .. ; /bin/chmod 700 .                  ; \
        /bin/tar cvf ./CSC523HMfall2022_parson.tar CSC523HMfall2022      ; \
        /bin/gzip ./CSC523HMfall2022_parson.tar                    ; \
        /bin/chmod 666 ./CSC523HMfall2022_parson.tar.gz            ; \
        /bin/mv ./CSC523HMfall2022_parson.tar.gz ~parson/incoming"
CSC523HMfall2022/
CSC523HMfall2022/__pycache__/
CSC523HMfall2022/makelib
CSC523HMfall2022/arfflib_3_2.py
CSC523HMfall2022/__init__.py
CSC523HMfall2022/fakegen.py
CSC523HMfall2022/CSC523HMfall2022Out.txt.ref
CSC523HMfall2022/CSC523HMfall2022Out.sorted.txt.ref
CSC523HMfall2022/CSC523HMfall2022Trace.txt.ref
CSC523HMfall2022/day_aggregate_HMS_1976_2021_kupapcsit01.arff
CSC523HMfall2022/plotcsv_1_2.py
CSC523HMfall2022/diffarff.py
CSC523HMfall2022/CSC523HMfall2022_main.py
CSC523HMfall2022/CSC523HMfall2022_generator.py
CSC523HMfall2022/makefile
CSC523HMfall2022/README.txt