CSC 523 - Scripting
for Data Science, Fall 2022, Assignment 1 (official).
Assignment 1 due 11:59 PM Tuesday September 27 via make
turnitin. Adds after first class in red.
Added 9/12:
Once this if is satisfied:
for line in
treeLines:
matched = AST.LMStartPattern.match(line)
if matched:
you can use matched.group(N) to get the Nth parenthesized
matching string.
Added 8/31 after a student found this missing. Changed part in
bold red:
# A DecisionString is a series of zero or more optional
'|' characters
# denoting nested "if" conditions,
followed by a mandatory IDString,
# followed by a mandatory test string (one
of <, <=, >, >=),
# followed by a floating point constant
number (optional
plus or minus sign, one or more
mandatory
# digits, a mandatory '.', another one or
more mandatory digits), followed
# by a mandatory ':', followed by an
optional LMString, followed
# by zero or more characters until the end
of string.
# Use string concatenation to glue
DecisionString from its substrings
# and then compile it. BE CAREFUL because
'|' is a meta-character
# used for ORing two alternatiove patterns
and therefore must be
# escaped.
Added 8/31 after answering a student's very good
question at the bottom.
1. To get the assignment on acad (kuvapcsitrd01.kutztown.edu)
or mcgonagall (ssh kupapcsit01 from acad):
mkdir ~/DataMine #
(This is in your home directory; it may exists if you took one of
my data science courses.)
cd ~/DataMine
cp
~parson/DataMine/CSC523assn1REfall2022.problem.zip
CSC523assn1REfall2022.problem.zip
unzip CSC523assn1REfall2022.problem.zip
cd ./CSC523assn1REfall2022
make clean test
# This fails on handout code as follows:
$ make clean test
/bin/rm -f *.o *.class .jar core *.exe *.obj *.pyc
__pycache__/*.pyc
/bin/rm -f junk* *.pyc *.err
/bin/rm -f *.tmp *.o *.dif *.out __pycache__/*
/bin/rm -f raptoryear_BE_All_NORM_TMP_m5pNode10.py
raptoryear_NH_All_NORM_TMP_m5pNode10.py
raptoryear_SS_All_NORM_TMP_m5pNode10.py
year_CloudCover_mean_TMP_M5P.py
/usr/local/bin/python3.7 parseM5Ptree.py
raptoryear_BE_All_NORM_TMP_m5pNode10
raptoryear_NH_All_NORM_TMP_m5pNode10
raptoryear_SS_All_NORM_TMP_m5pNode10 year_CloudCover_mean_
TMP_M5P
ERROR IN tree.parseTreeAndRules(): ERROR parsing M5P line:
yearSince1976<=0.478:LM1(22/6.545%)
INTERNAL DATA DUMP:
# NAME=raptoryear_BE_All_NORM_TMP_m5pNode10
# TARGET=None
# ATTRIBUTES=
# TESTS=
# LINEAR_EXPRESSIONS=
make: *** [raptoryear_BE_All_NORM_TMP_m5pNode10.py] Error 1
2. Edit parseM5Ptree.py and search for
upper-case STUDENT comments. The requirements with %
point values are there.
See the
course page on Notepad++ if you are new to our Linux
systems, or you can log in and use the vim or emacs editor.
Scroll down to Basic
UNIX Information on this page.
Pythex is
your friend for testing & debugging Python regular
expressions.
Solution to the fall 2020 CSC523 re
assignment is on acad at
~parson//DataMine/csc523F20TCPUDP.solution.zip, also unzipped in
~parson/DataMine.
See Assignment 1 on the Fall
2020 CSC523 course page. This is just an example, not
part of your assignment 1.
The Purpose of this code is to parse the textual M5P model trees
produced by the Weka tool's
analysis of up to 46 years of annual raptor counts and weather
statistics from Hawk Mountain Sanctuary's
North Lookout, and then generate Python model code that
implements these M5P models. We will make heavy use of Python's
regular expression library re as well as the Pythex
interactive tool for testing regular expressions in this
assignment.
Here are the names of the four M5P files that your code will parse
and compile to Python:
raptoryear_BE_All_NORM_TMP_m5pNode10.txt
raptoryear_NH_All_NORM_TMP_m5pNode10.txt
raptoryear_SS_All_NORM_TMP_m5pNode10.txt
year_CloudCover_mean_TMP_M5P.txt
BE_All is the Bald Eagle count for each year, NH is Northern
Harrier, SS is sharp-shinned hawk, and CloudCover_mean is the
average cloud cover 0% to 100% for each year in the data.
Here are the initial lines of raptoryear_BE_All_NORM_TMP_m5pNode10.txt
with the portion passed to your parser (function parseTreeAndRules(self,
treeLines)) in bold.
Options: -M 10.0 -num-decimal-places 4
=== Classifier model (full training set) ===
M5 pruned model tree:
(using smoothed linear models)
yearSince1976 <= 0.478 : LM1 (22/6.545%)
yearSince1976 > 0.478 : LM2 (24/27.445%)
LM num: 1
BE_All =
372.6098 * yearSince1976
- 46.2628 * wndUNK_last
- 9.7266
LM num: 2
BE_All =
750.7202 * yearSince1976
- 72.3768 * wndN_1st
- 43.8904 * wndUNK_last
- 166.6029
Number of Rules : 2
The decision tree of this M5P model compares attribute
yearSince1976 to constant value 0.478, with values <= that
constant running linear expression 1, else running linear
expression 2. Attribute yearSince1976 has been previous normalized
into the range [0.0, 1.0], where 0.0 is 1976, the min year in the
data, and 1.0 is 2021, the latest (max) year in the data. We
sometimes normalize data to put all attributes on a single scale
for viewing or model building. (2021-1976)*0.478+1976 is 1997.51;
the cutoff for the decision is between 1997 and 1998. Attribute
wndUNK_last is the day-of-year for the last non-zero count of
wind-direction-unknown (no wind), and wndN_1st is the day-of-year
for the first non-zero count of north wind at Hawk Mountain.
Here is the simpler M5P model in raptoryear_NH_All_NORM_TMP_m5pNode10.txt
with your parser's text again in bold. There is no decision tree
in this model. All data use a single linear expression. Attribute
wndENE_75th is the day-of-year for 75% of the east-northeast count
for the year, and HourlyWindSpeed_min is the year's minimum wind
speed in km/hour at Allentown Airport. You do not need to
understand these attribute meanings to write this parser. You may
need to understand some of them in later assignments.
M5 pruned model tree:
(using smoothed linear models)
LM1 (46/57.473%)
LM num: 1
NH_All =
-132.2203 * yearSince1976
- 128.0648 * wndENE_75th
+ 155.6269 * HourlyWindSpeed_min
+ 312.6288
Number of Rules : 1
There are more lines in each of these four input text files. My
supplied code reduces them down to the parts that your
function parseTreeAndRules(self, treeLines) needs
to parse.
Here is the Python code generated from raptoryear_BE_All_NORM_TMP_m5pNode10.txt
into raptoryear_BE_All_NORM_TMP_m5pNode10.py. You are not
responsible for generating code. Your function builds an annotated
syntax tree (AST), which is a data structure that stores the
components of an M5P tree. My supplied code generates Python from
your AST.
def
make_raptoryear_BE_All_NORM_TMP_m5pNode10(attrNamesToColumns):
wndN_1st_COLUMN =
attrNamesToColumns["wndN_1st"]
wndUNK_last_COLUMN =
attrNamesToColumns["wndUNK_last"]
yearSince1976_COLUMN =
attrNamesToColumns["yearSince1976"]
def
raptoryear_BE_All_NORM_TMP_m5pNode10(rowOfData):
wndN_1st=
rowOfData[wndN_1st_COLUMN]
wndUNK_last=
rowOfData[wndUNK_last_COLUMN]
yearSince1976=
rowOfData[yearSince1976_COLUMN]
if (yearSince1976 <=
0.478):
BE_All=372.6098*yearSince1976-46.2628*wndUNK_last-9.7266
else: #elif
(yearSince1976 > 0.478):
BE_All=750.7202*yearSince1976-72.3768*wndN_1st-43.8904*wndUNK_last-1
66.6029
return BE_All
return raptoryear_BE_All_NORM_TMP_m5pNode10
target_raptoryear_BE_All_NORM_TMP_m5pNode10 = "BE_All"
attributes_raptoryear_BE_All_NORM_TMP_m5pNode10 =
["wndN_1st","wndUNK_last","yea
rSince1976"]
The generated code is actually a Python closure, which we
will go over in class. Function make_raptoryear_BE_All_NORM_TMP_m5pNode10
binds the attribute column indices in the data into local
variables, then defines and returns function raptoryear_BE_All_NORM_TMP_m5pNode10.
This latter function is later called once for each row of data,
modeling BE_All as a function of the other, non-target attributes.
Here is the generated code raptoryear_NH_All_NORM_TMP_m5pNode10.py
for raptoryear_NH_All_NORM_TMP_m5pNode10.txt.
def
make_raptoryear_NH_All_NORM_TMP_m5pNode10(attrNamesToColumns):
HourlyWindSpeed_min_COLUMN =
attrNamesToColumns["HourlyWindSpeed_min"]
wndENE_75th_COLUMN =
attrNamesToColumns["wndENE_75th"]
yearSince1976_COLUMN =
attrNamesToColumns["yearSince1976"]
def
raptoryear_NH_All_NORM_TMP_m5pNode10(rowOfData):
HourlyWindSpeed_min=
rowOfData[HourlyWindSpeed_min_COLUMN]
wndENE_75th=
rowOfData[wndENE_75th_COLUMN]
yearSince1976=
rowOfData[yearSince1976_COLUMN]
NH_All=-132.2203*yearSince1976-128.0648*wndENE_75th+155.6269*HourlyWindS
peed_min+312.6288
return NH_All
return raptoryear_NH_All_NORM_TMP_m5pNode10
target_raptoryear_NH_All_NORM_TMP_m5pNode10 = "NH_All"
attributes_raptoryear_NH_All_NORM_TMP_m5pNode10 =
["HourlyWindSpeed_min","wndENE
_75th","yearSince1976"]
3. We will go over the handout code in class.
When you are ready to test your code, type make
clean test in the code directory. A successful
test run looks like this:
$ make clean test
/bin/rm -f *.o *.class .jar core *.exe *.obj *.pyc
__pycache__/*.pyc
/bin/rm -f junk* *.pyc *.err
/bin/rm -f *.tmp *.o *.dif *.out __pycache__/*
/bin/rm -f raptoryear_BE_All_NORM_TMP_m5pNode10.py
raptoryear_NH_All_NORM_TMP_m5pNode10.py
raptoryear_SS_All_NORM_TMP_m5pNode10.py
year_CloudCover_mean_TMP_M5P.py
/usr/local/bin/python3.7 parseM5Ptree.py
raptoryear_BE_All_NORM_TMP_m5pNode10
raptoryear_NH_All_NORM_TMP_m5pNode10
raptoryear_SS_All_NORM_TMP_m5pNode10 year_CloudCover_mean_TMP_M5P
bash ./mydiff.sh raptoryear_BE_All_NORM_TMP_m5pNode10.py
raptoryear_NH_All_NORM_TMP_m5pNode10.py
raptoryear_SS_All_NORM_TMP_m5pNode10.py
year_CloudCover_mean_TMP_M5P.py
diff --ignore-trailing-space --strip-trailing-cr
raptoryear_BE_All_NORM_TMP_m5pNode10.py
raptoryear_BE_All_NORM_TMP_m5pNode10.py.ref >
raptoryear_BE_All_NORM_TMP_m5pNode10.py.dif
diff --ignore-trailing-space --strip-trailing-cr
raptoryear_NH_All_NORM_TMP_m5pNode10.py
raptoryear_NH_All_NORM_TMP_m5pNode10.py.ref >
raptoryear_NH_All_NORM_TMP_m5pNode10.py.dif
diff --ignore-trailing-space --strip-trailing-cr
raptoryear_SS_All_NORM_TMP_m5pNode10.py
raptoryear_SS_All_NORM_TMP_m5pNode10.py.ref >
raptoryear_SS_All_NORM_TMP_m5pNode10.py.dif
diff --ignore-trailing-space --strip-trailing-cr
year_CloudCover_mean_TMP_M5P.py
year_CloudCover_mean_TMP_M5P.py.ref >
year_CloudCover_mean_TMP_M5P.py.dif
echo "TESTS PASS."
TESTS PASS.
Tests can fail in one of two ways. Script parseM5Ptree.py may blow
up on a bug with an error message to the terminal, e.g., the
handout code bug above:
ERROR IN tree.parseTreeAndRules(): ERROR parsing M5P line:
yearSince1976<=0.478:LM1(22/6.545%)
INTERNAL DATA DUMP:
# NAME=raptoryear_BE_All_NORM_TMP_m5pNode10
# TARGET=None
# ATTRIBUTES=
# TESTS=
# LINEAR_EXPRESSIONS=
make: *** [raptoryear_BE_All_NORM_TMP_m5pNode10.py] Error 1
Or the program may run without blowing up but produce incorrect
output as flagged by these diff steps above:
diff --ignore-trailing-space --strip-trailing-cr
raptoryear_BE_All_NORM_TMP_m5pNode10.py
raptoryear_BE_All_NORM_TMP_m5pNode10.py.ref >
raptoryear_BE_All_NORM_TMP_m5pNode10.py.dif
The *.dif file (* is shorthand for the model name) shows
differences between output file *.py and correct reference file
*.py.ref. You may need to use an editor to compare the difference
lines from f*.py and *.py.ref if *.dif is too hard to interpret. I
will demo a diff in class.
4. After make clean test works without
errors (terminates without an error message), type make
turnitin and hit Enter at the prompt to get your work to
me before the deadline.
If you make subsequent changes and make clean test still
passes, you can run make turnitin again and over-write
your previous submission. Note that this is not the "turnin"
script you may have used in other courses.
There is a 10% per day penalty for late assignments in my courses
and I cannot grant any points after I go over a solution.
$ make turnitin
/bin/rm -f *.o *.class .jar core *.exe *.obj *.pyc
__pycache__/*.pyc
/bin/rm -f junk* *.pyc *.err
/bin/rm -f *.tmp *.o *.dif *.out __pycache__/*
/bin/rm -f raptoryear_BE_All_NORM_TMP_m5pNode10.py
raptoryear_NH_All_NORM_TMP_m5pNode10.py
raptoryear_SS_All_NORM_TMP_m5pNode10.py
year_CloudCover_mean_TMP_M5P.py
Do you really want to send CSC523assn1REfall2022 to Professor
Parson?
Hit Enter to continue, control-C to abort.
/bin/bash -c "cd .. ; /bin/chmod 700
.
; \
/bin/tar cvf
./CSC523assn1REfall2022_parson.tar
CSC523assn1REfall2022 ; \
/bin/gzip
./CSC523assn1REfall2022_parson.tar
; \
/bin/chmod 666
./CSC523assn1REfall2022_parson.tar.gz
; \
/bin/mv
./CSC523assn1REfall2022_parson.tar.gz ~parson/incoming"
CSC523assn1REfall2022/
CSC523assn1REfall2022/arfflib_3_2.py
CSC523assn1REfall2022/mydiff.sh
CSC523assn1REfall2022/raptoryear_BE_All_NORM_TMP_m5pNode10.py.ref
CSC523assn1REfall2022/raptoryear_BE_All_NORM_TMP_m5pNode10.txt
CSC523assn1REfall2022/raptoryear_NH_All_NORM_TMP_m5pNode10.py.ref
CSC523assn1REfall2022/raptoryear_NH_All_NORM_TMP_m5pNode10.txt
CSC523assn1REfall2022/raptoryear_SS_All_NORM_TMP_m5pNode10.py.ref
CSC523assn1REfall2022/raptoryear_SS_All_NORM_TMP_m5pNode10.txt
CSC523assn1REfall2022/verify_generatedCode.arff
CSC523assn1REfall2022/verify_generatedCode.py
CSC523assn1REfall2022/year_aggregate_HMS_1976_2021.csv
CSC523assn1REfall2022/year_CloudCover_mean_TMP_M5P.py.ref
CSC523assn1REfall2022/year_CloudCover_mean_TMP_M5P.txt
CSC523assn1REfall2022/__pycache__/
CSC523assn1REfall2022/makelib
CSC523assn1REfall2022/makefile
CSC523assn1REfall2022/plotcsv.py
CSC523assn1REfall2022/parseM5Ptree.py
------------------------------------------------------------------
Part of a reply to a student's question on 8/31:
I have been using this mostly for visualization so far, to compare
a model's predictions to the actual target attributes values, for
example Figures 3 through 5 here:
https://acad.kutztown.edu/~parson/HawkMtnDaleParson2022/#WEATHER
Weka doesn't give you a way to visualize models. You could use
AddExpression if you have the patience to type in a complicate
ifelse() nested expression equivalent to the M5P. That is time
intensive and error prone. I coded Weka output models in Python by
hand to generate Figures 3-5 csv files and others in the above
report. Our assignment 1 automates the coding for visualizations
like the following. First run make verify:
$ make verify
/usr/local/bin/python3.7 verify_generatedCode.py
raptoryear_BE_All_NORM_TMP_m5pNode10
raptoryear_NH_All_NORM_TMP_m5pNode10
raptoryear_SS_All_NORM_TMP_m5pNode10 year_CloudCover_mean_TMP_M5P
Processing module raptoryear_BE_All_NORM_TMP_m5pNode10
TARGET= BE_All
ATTRIBUTES= ['wndN_1st', 'wndUNK_last', 'yearSince1976']
Processing module raptoryear_NH_All_NORM_TMP_m5pNode10
TARGET= NH_All
ATTRIBUTES= ['HourlyWindSpeed_min', 'wndENE_75th',
'yearSince1976']
Processing module raptoryear_SS_All_NORM_TMP_m5pNode10
TARGET= SS_All
ATTRIBUTES= ['HourlyWetBulbTemperature_24_min',
'HourlyWindSpeed_mean', 'SkyCode_median', 'WindSpd_median',
'noaawdUNK', 'wndE_75th', 'wndNW_75th']
Processing module year_CloudCover_mean_TMP_M5P
TARGET= CloudCover_mean
ATTRIBUTES= ['yearSince1976']
That creates file verify_generatedCode.arff from which you
can create attribute line graphs like this:
$ python plotcsv.py verify_generatedCode.arff year
SS_All raptoryear_SS_All_NORM_TMP_m5pNode10
That creates the following output graph that compares SS_All count
(sharp-shinned hawks) to the M5P model's prediction in
raptoryear_SS_All_NORM_TMP_m5pNode10.
SS_All in red - sharp-shinned hawk counts by year - compared to
M5P model in blue.
The above plotcsv.py command line generates an interactive graph
on your local machine. To generate a PNG file on acad or
mcgonagall that you can view remotely, add -file after the
X-axis attribute:
$ python plotcsv.py verify_generatedCode.arff year -file SS_All
raptoryear_SS_All_NORM_TMP_m5pNode10
DEBUG X TYPE <class 'float'> 1976.0
X is year Type is <class 'float'>
MEAN 1998.5
MEDIAN 1998.5 PSTDEV
13.275918047351754 MIN
1976.0 MAX 2021.0
Y is raptoryear_SS_All_NORM_TMP_m5pNode10 Type is <class
'float'>
BROWSE https://acad.kutztown.edu/~parson/plotcsv.png
Those remote-generated PNG files are less visually elegant than
the interactive graph on your local machine. For this to work you
must have a login directory (~) and a ~/public_html directory with
read and execute permissions enabled. Here are mine as an example.
$ ls -ld ~
drwxr-xr-x. 26 parson apache 4096 Aug 26 13:26
/home/kutztown.edu/parson
$ ls -ld ~/public_html
drwxr-xr-x. 8 parson csit_faculty 20480 Aug 23 16:47
/home/kutztown.edu/parson/public_html
If you do not see the ~/public_html directory do this:
$ mkdir ~/public_html
If you do not see "r-x" in the bottom 3 permission characters do
this:
$ chmod o+r+x ~
$ chmod o+r+x ~/public_html
Check permissions again with the "ls" command per above
instructions.