CSC 523 - Scripting
for Data Science, Fall 2022, Assignment 3 (official).
Assignment 3 due 11:59 PM Friday October 28 via make
turnitin. You must test on mcgonagall.
Added Figures 11 & 12 on 10/23/2022,
reviewed them previous Monday's class.
1. To get the assignment on mcgonagall (ssh
mcgonagall or ssh kupapcsit01 from acad):
mkdir ~/DataMine #
(This is in your home directory; it may exists if you took one of
my data science courses.)
cd ~/DataMine
cp
~parson/DataMine/CSC523Fall2022Classify.problem.zip
CSC523Fall2022Classify.problem.zip
unzip CSC523Fall2022Classify.problem.zip
cd ./CSC523Fall2022Classify
make clean test
# This fails on handout code due to
testing diffs from missing data derivations and models.
# It is working code.
These are the attributes we will use in this analysis in
month_aggregate_HMS_goodyears.arff.gz,
which is a monthly aggregate of daily aggregates of (mostly
1-hour) observation periods.
year
1976-2021
month
8-12
HMtempC_mean
mean for month of temp Celsius during observation times
WindSpd_mean
same for wind speed in km/hour
HMtempC_median
median for month
WindSpd_median
HMtempC_pstdv
population standard deviation
WindSpd_pstdv
HMtempC_min
minimum & maximum
WindSpd_min
HMtempC_max
WindSpd_max
wndN
tally of North winds for all observations in the month, etc.
wndNNE
wndNE
wndENE
wndE
wndESE
wndSE
wndSSE
wndS
wndSSW
wndSW
wndWSW
wndW
wndWNW
wndNW
wndNNW
wndUNK
HMtempC_24_mean
Changes
in magnitude (absolute value of change) over 24, 48, and 72
hours
HMtempC_48_mean
HMtempC_72_mean
HMtempC_24_median
HMtempC_48_median
HMtempC_72_median
HMtempC_24_pstdv
HMtempC_48_pstdv
HMtempC_72_pstdv
HMtempC_24_min
The min
& max are their signed values.
HMtempC_48_min
HMtempC_72_min
HMtempC_24_max
HMtempC_48_max
HMtempC_72_max
SS_All
Tally of sharp-shinned hawk observations
during each month 8-12, 1976-2021.
2. Edit CSC523Fall2022Classify_generator.py
and search for upper-case STUDENT comments. The
requirements with % point values are there. There is also a
file README.txt in which you must answer questions.
The code is worth 30% of the assignment and README.txt the
remaining 70%. You can work on the README.txt before
completing the code if you want since it uses the Reference
files of expected output.
MAKE SURE to search for all upper-case STUDENT comments,
especially ones starting with "# STUDENTS MUST uncomment this
code". Some working code is commented out until you define
derived data views and models (regressors & classifiers)
bound to the specified Python variables.
Here are Figures relating to questions in README.txt.
Figure 1: Initial distribution of numeric target attribute
SS_All (NumericTarget)
Figure 2: Distribution of Log10Target (targetdataLog10 STUDENT
1), log10(SS_All) values
Take 10Maximum to get the max of Figure 1
Figure 3: Distribution of SqrtTarget (targetdataSqrt),
sqrt(SS_All) values
Take Maximum2 to get the max of Figure 1.
Figure 4: Equal-value-range discretization distribution of
SS_All in targetdataNomRange
Figure 5: Equal-bin-frequency discretization distribution of
SS_All in targetdataNomEqFreq (STUDENT 2)
Figure 6: Y=Log10Target of Figure 2 as a function of
X=NumericTarget in Figure 1.
Following Figures Added after Handout
Distributed Late Morning of 10/12 to help with README.
Figure 7: M5P
model of SS_All, original counts by year, month
7 Linear Expressions at the Leaves
Correlation
coefficient
0.9076
Mean absolute
error
407.9356
Root mean squared
error
736.9916
Relative absolute
error
31.0553 %
Root relative squared
error
42.2449 %
Total Number of
Instances
226
Figure 8: Same Data as Figure 7, Final Years
Figure 9: M5P model of SS_All_Log10,
original counts by year, month
8 Linear Expressions at the Leaves
Correlation
coefficient
0.965
Mean absolute
error
0.2364
Root mean squared
error
0.3031
Relative absolute
error
23.9819 %
Root relative squared
error
26.4643 %
Total Number of
Instances
226
Figure 10: Same Data as Figure 9, Final Years
FIGURE 11: SS_ALL observations as a function of discrete month 8
through 12.
FIGURE 12: SS_ALL_Log10 observations as a function of discrete
month 8 through 12.
3. We will go over this handout code in class
on October 17. When you are ready to test your code, type make
clean test in the code directory. A successful
test run looks like this:
$ make clean test
/bin/rm -f *.o *.class .jar core *.exe *.obj *.pyc
__pycache__/*.pyc
/bin/rm -f junk* *.pyc CSC523Fall2022ClassifyTrace.txt
CSC523Fall2022ClassifyOut.txt
/bin/rm -f *.tmp *.o *.dif *.out *.csv __pycache__/*
/bin/rm -f CSC523Fall2022ClassifyTrace.txt
CSC523Fall2022ClassifyOut.txt
/usr/local/bin/python3.7 CSC523Fall2022Classify_main.py
CSC523Fall2022ClassifyTrace.txt CSC523Fall2022Classify_generator
month_aggregate_HMS_goodyears.arff '' >
CSC523Fall2022ClassifyOut.txt
diff --ignore-trailing-space --strip-trailing-cr
CSC523Fall2022ClassifyOut.txt CSC523Fall2022ClassifyOut.txt.ref
> CSC523Fall2022ClassifyOut.txt.dif
sed -e 's/^TIME[^D]*DATA/DATA/' <
CSC523Fall2022ClassifyTrace.txt >
CSC523Fall2022ClassifyTrace.tmp
diff --ignore-trailing-space --strip-trailing-cr
CSC523Fall2022ClassifyTrace.tmp
CSC523Fall2022ClassifyTrace.txt.ref >
CSC523Fall2022ClassifyTrace.txt.dif
grep DATA CSC523Fall2022ClassifyOut.txt.ref | sort -rn -t' ' -k13
| sed -e 's/^DATA/\nDATA/' | grep REGRESSOR >
CSC523Fall2022ClassifyOut.sorted.txt.ref
echo "" >> CSC523Fall2022ClassifyOut.sorted.txt.ref
grep DATA CSC523Fall2022ClassifyOut.txt.ref | sort -rn -t' ' -k17
| sed -e 's/^DATA/\nDATA/' | grep CLASSIFIER >>
CSC523Fall2022ClassifyOut.sorted.txt.ref
4. Make sure to answer all questions in README.txt.
You can do this before your coding is working.
5. After make clean test works without
errors (terminates without an error message) and README.txt
is answered, type make turnitin and hit Enter at the
prompt to get your work to me before the deadline.
If you make subsequent changes and make clean test still
passes, you can run make turnitin again and over-write
your previous submission. Note that this is not the "turnin"
script you may have used in other courses.
There is a 10% per day penalty for late assignments in my courses
and I cannot grant any points after I go over a solution.
$ make turnitin
/bin/rm -f *.o *.class .jar core *.exe *.obj *.pyc
__pycache__/*.pyc
/bin/rm -f junk* *.pyc CSC523Fall2022ClassifyTrace.txt
CSC523Fall2022ClassifyOut.txt
/bin/rm -f *.tmp *.o *.dif *.out *.csv __pycache__/*
Do you really want to send CSC523Fall2022Classify to Professor
Parson?
Hit Enter to continue, control-C to abort.
/bin/bash -c "cd .. ; /bin/chmod 700
.
; \
/bin/tar cvf
./CSC523Fall2022Classify_parson.tar
CSC523Fall2022Classify ; \
/bin/gzip
./CSC523Fall2022Classify_parson.tar
; \
/bin/chmod 666
./CSC523Fall2022Classify_parson.tar.gz
; \
/bin/mv
./CSC523Fall2022Classify_parson.tar.gz ~parson/incoming"
CSC523Fall2022Classify/
CSC523Fall2022Classify/__pycache__/
CSC523Fall2022Classify/makelib
CSC523Fall2022Classify/diffarff.py
CSC523Fall2022Classify/plotcsv_1_2.py
CSC523Fall2022Classify/arfflib_3_3.py
CSC523Fall2022Classify/__init__.py
CSC523Fall2022Classify/month_aggregate_HMS_goodyears.arff
CSC523Fall2022Classify/CSC523Fall2022Classify_main.py
CSC523Fall2022Classify/CSC523Fall2022ClassifyOut.txt.ref
CSC523Fall2022Classify/CSC523Fall2022ClassifyTrace.txt.ref
CSC523Fall2022Classify/CSC523Fall2022ClassifyOut.sorted.txt.ref
CSC523Fall2022Classify/CSC523Fall2022Classify_generator.py
CSC523Fall2022Classify/F2022Assn3Keepers.arff.gz
CSC523Fall2022Classify/makefile
CSC523Fall2022Classify/README.txt