CSC 523 - Scripting for Data Science, Fall 2022, Assignment 1 (official).

Assignment 1 due 11:59 PM Tuesday September 27 via make turnitin. Adds after first class in red.

Added 9/12:
Once this if is satisfied:
        for line in treeLines:
            matched = AST.LMStartPattern.match(line)
            if matched:
you can use matched.group(N) to get the Nth parenthesized matching string.


Added 8/31 after a student found this missing. Changed part in bold red:

    # A DecisionString is a series of zero or more optional '|' characters
    # denoting nested "if" conditions, followed by a mandatory IDString,
    # followed by a mandatory test string (one of <, <=, >, >=),
    # followed by a floating point constant number (optional plus or minus sign, one or more mandatory
    # digits, a mandatory '.', another one or more mandatory digits), followed
    # by a mandatory ':', followed by an optional LMString, followed
    # by zero or more characters until the end of string.
    # Use string concatenation to glue DecisionString from its substrings
    # and then compile it. BE CAREFUL because '|' is a meta-character
    # used for ORing two alternatiove patterns and therefore must be
    # escaped.

Added 8/31 after answering a student's very good question at the bottom.

1. To get the assignment on acad (kuvapcsitrd01.kutztown.edu) or mcgonagall (ssh kupapcsit01 from acad):

    mkdir ~/DataMine   #  (This is in your home directory; it may exists if you took one of my data science courses.)
    cd ~/DataMine
    cp ~parson/DataMine/CSC523assn1REfall2022.problem.zip CSC523assn1REfall2022.problem.zip
    unzip CSC523assn1REfall2022.problem.zip
    cd ./CSC523assn1REfall2022
    make clean test        # This fails on handout code as follows:

$ make clean test
/bin/rm -f *.o *.class .jar core *.exe *.obj *.pyc __pycache__/*.pyc
/bin/rm -f junk* *.pyc *.err
/bin/rm -f *.tmp *.o *.dif *.out __pycache__/*
/bin/rm -f raptoryear_BE_All_NORM_TMP_m5pNode10.py raptoryear_NH_All_NORM_TMP_m5pNode10.py raptoryear_SS_All_NORM_TMP_m5pNode10.py year_CloudCover_mean_TMP_M5P.py
/usr/local/bin/python3.7 parseM5Ptree.py raptoryear_BE_All_NORM_TMP_m5pNode10 raptoryear_NH_All_NORM_TMP_m5pNode10 raptoryear_SS_All_NORM_TMP_m5pNode10 year_CloudCover_mean_
TMP_M5P

ERROR IN tree.parseTreeAndRules(): ERROR parsing M5P line: yearSince1976<=0.478:LM1(22/6.545%)

INTERNAL DATA DUMP:
# NAME=raptoryear_BE_All_NORM_TMP_m5pNode10
# TARGET=None
# ATTRIBUTES=
# TESTS=
# LINEAR_EXPRESSIONS=

make: *** [raptoryear_BE_All_NORM_TMP_m5pNode10.py] Error 1

2. Edit parseM5Ptree.py and search for upper-case STUDENT comments. The requirements with % point values are there.

    See the course page on Notepad++ if you are new to our Linux systems, or you can log in and use the vim or emacs editor.
    Scroll down to Basic UNIX Information on this page.
    Pythex is your friend for testing & debugging Python regular expressions.
    Solution to the fall 2020 CSC523 re assignment is on acad at ~parson//DataMine/csc523F20TCPUDP.solution.zip, also unzipped in ~parson/DataMine.
    See Assignment 1 on the Fall 2020 CSC523 course page. This is just an example, not part of your assignment 1.

The Purpose of this code is to parse the textual M5P model trees produced by the Weka tool's analysis of up to 46 years of annual raptor counts and weather statistics from Hawk Mountain Sanctuary's North Lookout, and then generate Python model code that implements these M5P models. We will make heavy use of Python's regular expression library re as well as the Pythex interactive tool for testing regular expressions in this assignment.

Here are the names of the four M5P files that your code will parse and compile to Python:
    raptoryear_BE_All_NORM_TMP_m5pNode10.txt
    raptoryear_NH_All_NORM_TMP_m5pNode10.txt
    raptoryear_SS_All_NORM_TMP_m5pNode10.txt
    year_CloudCover_mean_TMP_M5P.txt

BE_All is the Bald Eagle count for each year, NH is Northern Harrier, SS is sharp-shinned hawk, and CloudCover_mean is the average cloud cover 0% to 100% for each year in the data.

Here are the initial lines of raptoryear_BE_All_NORM_TMP_m5pNode10.txt with the portion passed to your parser (function  parseTreeAndRules(self, treeLines)) in bold.

Options: -M 10.0 -num-decimal-places 4

=== Classifier model (full training set) ===

M5 pruned model tree:
(using smoothed linear models)

yearSince1976 <= 0.478 : LM1 (22/6.545%)
yearSince1976 >  0.478 : LM2 (24/27.445%)

LM num: 1
BE_All =
    372.6098 * yearSince1976
    - 46.2628 * wndUNK_last
    - 9.7266

LM num: 2
BE_All =
    750.7202 * yearSince1976
    - 72.3768 * wndN_1st
    - 43.8904 * wndUNK_last
    - 166.6029

Number of Rules : 2

The decision tree of this M5P model compares attribute yearSince1976 to constant value 0.478, with values <= that constant running linear expression 1, else running linear expression 2. Attribute yearSince1976 has been previous normalized into the range [0.0, 1.0], where 0.0 is 1976, the min year in the data, and 1.0 is 2021, the latest (max) year in the data. We sometimes normalize data to put all attributes on a single scale for viewing or model building. (2021-1976)*0.478+1976 is 1997.51; the cutoff for the decision is between 1997 and 1998. Attribute wndUNK_last is the day-of-year for the last non-zero count of wind-direction-unknown (no wind), and wndN_1st is the day-of-year for the first non-zero count of north wind at Hawk Mountain.

Here is the simpler M5P model in raptoryear_NH_All_NORM_TMP_m5pNode10.txt with your parser's text again in bold. There is no decision tree in this model. All data use a single linear expression. Attribute wndENE_75th is the day-of-year for 75% of the east-northeast count for the year, and HourlyWindSpeed_min is the year's minimum wind speed in km/hour at Allentown Airport. You do not need to understand these attribute meanings to write this parser. You may need to understand some of them in later assignments.

M5 pruned model tree:
(using smoothed linear models)
LM1 (46/57.473%)

LM num: 1
NH_All =
    -132.2203 * yearSince1976
    - 128.0648 * wndENE_75th
    + 155.6269 * HourlyWindSpeed_min
    + 312.6288

Number of Rules : 1

There are more lines in each of these four input text files. My supplied code reduces them down to the parts that your function  parseTreeAndRules(self, treeLines) needs to parse.

Here is the Python code generated from raptoryear_BE_All_NORM_TMP_m5pNode10.txt into raptoryear_BE_All_NORM_TMP_m5pNode10.py. You are not responsible for generating code. Your function builds an annotated syntax tree (AST), which is a data structure that stores the components of an M5P tree. My supplied code generates Python from your AST.

def make_raptoryear_BE_All_NORM_TMP_m5pNode10(attrNamesToColumns):
    wndN_1st_COLUMN = attrNamesToColumns["wndN_1st"]
    wndUNK_last_COLUMN = attrNamesToColumns["wndUNK_last"]
    yearSince1976_COLUMN = attrNamesToColumns["yearSince1976"]
    def raptoryear_BE_All_NORM_TMP_m5pNode10(rowOfData):
        wndN_1st= rowOfData[wndN_1st_COLUMN]
        wndUNK_last= rowOfData[wndUNK_last_COLUMN]
        yearSince1976= rowOfData[yearSince1976_COLUMN]
        if (yearSince1976 <= 0.478):
            BE_All=372.6098*yearSince1976-46.2628*wndUNK_last-9.7266
        else:  #elif (yearSince1976 > 0.478):
            BE_All=750.7202*yearSince1976-72.3768*wndN_1st-43.8904*wndUNK_last-1
66.6029
        return BE_All
    return raptoryear_BE_All_NORM_TMP_m5pNode10
target_raptoryear_BE_All_NORM_TMP_m5pNode10 = "BE_All"
attributes_raptoryear_BE_All_NORM_TMP_m5pNode10 = ["wndN_1st","wndUNK_last","yea
rSince1976"]

The generated code is actually a Python closure, which we will go over in class. Function make_raptoryear_BE_All_NORM_TMP_m5pNode10 binds the attribute column indices in the data into local variables, then defines and returns function raptoryear_BE_All_NORM_TMP_m5pNode10. This latter function is later called once for each row of data, modeling BE_All as a function of the other, non-target attributes.

Here is the generated code raptoryear_NH_All_NORM_TMP_m5pNode10.py for raptoryear_NH_All_NORM_TMP_m5pNode10.txt.

def make_raptoryear_NH_All_NORM_TMP_m5pNode10(attrNamesToColumns):
    HourlyWindSpeed_min_COLUMN = attrNamesToColumns["HourlyWindSpeed_min"]
    wndENE_75th_COLUMN = attrNamesToColumns["wndENE_75th"]
    yearSince1976_COLUMN = attrNamesToColumns["yearSince1976"]
    def raptoryear_NH_All_NORM_TMP_m5pNode10(rowOfData):
        HourlyWindSpeed_min= rowOfData[HourlyWindSpeed_min_COLUMN]
        wndENE_75th= rowOfData[wndENE_75th_COLUMN]
        yearSince1976= rowOfData[yearSince1976_COLUMN]
        NH_All=-132.2203*yearSince1976-128.0648*wndENE_75th+155.6269*HourlyWindS
peed_min+312.6288
        return NH_All
    return raptoryear_NH_All_NORM_TMP_m5pNode10
target_raptoryear_NH_All_NORM_TMP_m5pNode10 = "NH_All"
attributes_raptoryear_NH_All_NORM_TMP_m5pNode10 = ["HourlyWindSpeed_min","wndENE
_75th","yearSince1976"]

3. We will go over the handout code in class. When you are ready to test your code, type make clean test in the code directory. A successful test run looks like this:

$ make clean test
/bin/rm -f *.o *.class .jar core *.exe *.obj *.pyc __pycache__/*.pyc
/bin/rm -f junk* *.pyc *.err
/bin/rm -f *.tmp *.o *.dif *.out __pycache__/*
/bin/rm -f raptoryear_BE_All_NORM_TMP_m5pNode10.py raptoryear_NH_All_NORM_TMP_m5pNode10.py raptoryear_SS_All_NORM_TMP_m5pNode10.py year_CloudCover_mean_TMP_M5P.py
/usr/local/bin/python3.7 parseM5Ptree.py raptoryear_BE_All_NORM_TMP_m5pNode10 raptoryear_NH_All_NORM_TMP_m5pNode10 raptoryear_SS_All_NORM_TMP_m5pNode10 year_CloudCover_mean_TMP_M5P
bash ./mydiff.sh raptoryear_BE_All_NORM_TMP_m5pNode10.py raptoryear_NH_All_NORM_TMP_m5pNode10.py raptoryear_SS_All_NORM_TMP_m5pNode10.py year_CloudCover_mean_TMP_M5P.py
diff  --ignore-trailing-space --strip-trailing-cr raptoryear_BE_All_NORM_TMP_m5pNode10.py raptoryear_BE_All_NORM_TMP_m5pNode10.py.ref > raptoryear_BE_All_NORM_TMP_m5pNode10.py.dif
diff  --ignore-trailing-space --strip-trailing-cr raptoryear_NH_All_NORM_TMP_m5pNode10.py raptoryear_NH_All_NORM_TMP_m5pNode10.py.ref > raptoryear_NH_All_NORM_TMP_m5pNode10.py.dif
diff  --ignore-trailing-space --strip-trailing-cr raptoryear_SS_All_NORM_TMP_m5pNode10.py raptoryear_SS_All_NORM_TMP_m5pNode10.py.ref > raptoryear_SS_All_NORM_TMP_m5pNode10.py.dif
diff  --ignore-trailing-space --strip-trailing-cr year_CloudCover_mean_TMP_M5P.py year_CloudCover_mean_TMP_M5P.py.ref > year_CloudCover_mean_TMP_M5P.py.dif
echo "TESTS PASS."
TESTS PASS.

Tests can fail in one of two ways. Script parseM5Ptree.py may blow up on a bug with an error message to the terminal, e.g., the handout code bug above:

ERROR IN tree.parseTreeAndRules(): ERROR parsing M5P line: yearSince1976<=0.478:LM1(22/6.545%)

INTERNAL DATA DUMP:
# NAME=raptoryear_BE_All_NORM_TMP_m5pNode10
# TARGET=None
# ATTRIBUTES=
# TESTS=
# LINEAR_EXPRESSIONS=

make: *** [raptoryear_BE_All_NORM_TMP_m5pNode10.py] Error 1

Or the program may run without blowing up but produce incorrect output as flagged by these diff steps above:

diff  --ignore-trailing-space --strip-trailing-cr raptoryear_BE_All_NORM_TMP_m5pNode10.py raptoryear_BE_All_NORM_TMP_m5pNode10.py.ref > raptoryear_BE_All_NORM_TMP_m5pNode10.py.dif

The *.dif file (* is shorthand for the model name) shows differences between output file *.py and correct reference file *.py.ref. You may need to use an editor to compare the difference lines from f*.py and *.py.ref if *.dif is too hard to interpret. I will demo a diff in class.

4. After make clean test works without errors (terminates without an error message), type make turnitin and hit Enter at the prompt to get your work to me before the deadline.

If you make subsequent changes and make clean test still passes, you can run make turnitin again and over-write your previous submission. Note that this is not the "turnin" script you may have used in other courses.

There is a 10% per day penalty for late assignments in my courses and I cannot grant any points after I go over a solution.

$ make turnitin
/bin/rm -f *.o *.class .jar core *.exe *.obj *.pyc __pycache__/*.pyc
/bin/rm -f junk* *.pyc *.err
/bin/rm -f *.tmp *.o *.dif *.out __pycache__/*
/bin/rm -f raptoryear_BE_All_NORM_TMP_m5pNode10.py raptoryear_NH_All_NORM_TMP_m5pNode10.py raptoryear_SS_All_NORM_TMP_m5pNode10.py year_CloudCover_mean_TMP_M5P.py

Do you really want to send CSC523assn1REfall2022 to Professor Parson?
Hit Enter to continue, control-C to abort.

/bin/bash -c "cd .. ; /bin/chmod 700 .                  ; \
        /bin/tar cvf ./CSC523assn1REfall2022_parson.tar CSC523assn1REfall2022      ; \
        /bin/gzip ./CSC523assn1REfall2022_parson.tar                    ; \
        /bin/chmod 666 ./CSC523assn1REfall2022_parson.tar.gz            ; \
        /bin/mv ./CSC523assn1REfall2022_parson.tar.gz ~parson/incoming"
CSC523assn1REfall2022/
CSC523assn1REfall2022/arfflib_3_2.py
CSC523assn1REfall2022/mydiff.sh
CSC523assn1REfall2022/raptoryear_BE_All_NORM_TMP_m5pNode10.py.ref
CSC523assn1REfall2022/raptoryear_BE_All_NORM_TMP_m5pNode10.txt
CSC523assn1REfall2022/raptoryear_NH_All_NORM_TMP_m5pNode10.py.ref
CSC523assn1REfall2022/raptoryear_NH_All_NORM_TMP_m5pNode10.txt
CSC523assn1REfall2022/raptoryear_SS_All_NORM_TMP_m5pNode10.py.ref
CSC523assn1REfall2022/raptoryear_SS_All_NORM_TMP_m5pNode10.txt
CSC523assn1REfall2022/verify_generatedCode.arff
CSC523assn1REfall2022/verify_generatedCode.py
CSC523assn1REfall2022/year_aggregate_HMS_1976_2021.csv
CSC523assn1REfall2022/year_CloudCover_mean_TMP_M5P.py.ref
CSC523assn1REfall2022/year_CloudCover_mean_TMP_M5P.txt
CSC523assn1REfall2022/__pycache__/
CSC523assn1REfall2022/makelib
CSC523assn1REfall2022/makefile
CSC523assn1REfall2022/plotcsv.py
CSC523assn1REfall2022/parseM5Ptree.py
------------------------------------------------------------------
Part of a reply to a student's question on 8/31:

I have been using this mostly for visualization so far, to compare a model's predictions to the actual target attributes values, for example Figures 3 through 5 here:

https://acad.kutztown.edu/~parson/HawkMtnDaleParson2022/#WEATHER

Weka doesn't give you a way to visualize models. You could use AddExpression if you have the patience to type in a complicate ifelse() nested expression equivalent to the M5P. That is time intensive and error prone. I coded Weka output models in Python by hand to generate Figures 3-5 csv files and others in the above report. Our assignment 1 automates the coding for visualizations like the following. First run make verify:

$ make verify
/usr/local/bin/python3.7 verify_generatedCode.py raptoryear_BE_All_NORM_TMP_m5pNode10 raptoryear_NH_All_NORM_TMP_m5pNode10 raptoryear_SS_All_NORM_TMP_m5pNode10 year_CloudCover_mean_TMP_M5P
Processing module raptoryear_BE_All_NORM_TMP_m5pNode10
TARGET= BE_All
ATTRIBUTES= ['wndN_1st', 'wndUNK_last', 'yearSince1976']
Processing module raptoryear_NH_All_NORM_TMP_m5pNode10
TARGET= NH_All
ATTRIBUTES= ['HourlyWindSpeed_min', 'wndENE_75th', 'yearSince1976']
Processing module raptoryear_SS_All_NORM_TMP_m5pNode10
TARGET= SS_All
ATTRIBUTES= ['HourlyWetBulbTemperature_24_min', 'HourlyWindSpeed_mean', 'SkyCode_median', 'WindSpd_median', 'noaawdUNK', 'wndE_75th', 'wndNW_75th']
Processing module year_CloudCover_mean_TMP_M5P
TARGET= CloudCover_mean
ATTRIBUTES= ['yearSince1976']

That creates file verify_generatedCode.arff from which you can create attribute line graphs like this:

$ python plotcsv.py verify_generatedCode.arff year SS_All  raptoryear_SS_All_NORM_TMP_m5pNode10

That creates the following output graph that compares SS_All count (sharp-shinned hawks) to the M5P model's prediction in raptoryear_SS_All_NORM_TMP_m5pNode10.

SS_All
SS_All in red - sharp-shinned hawk counts by year - compared to M5P model in blue.

The above plotcsv.py command line generates an interactive graph on your local machine. To generate a PNG file on acad or mcgonagall that you can view remotely, add -file after the X-axis attribute:

$ python plotcsv.py verify_generatedCode.arff year -file SS_All  raptoryear_SS_All_NORM_TMP_m5pNode10
DEBUG X TYPE <class 'float'> 1976.0
X is year Type is <class 'float'>
     MEAN 1998.5     MEDIAN 1998.5     PSTDEV 13.275918047351754     MIN 1976.0     MAX 2021.0
Y is raptoryear_SS_All_NORM_TMP_m5pNode10 Type is <class 'float'>
BROWSE https://acad.kutztown.edu/~parson/plotcsv.png

Those remote-generated PNG files are less visually elegant than the interactive graph on your local machine. For this to work you must have a login directory (~) and a ~/public_html directory with read and execute permissions enabled. Here are mine as an example.

$ ls -ld ~
drwxr-xr-x. 26 parson apache 4096 Aug 26 13:26 /home/kutztown.edu/parson
$ ls -ld ~/public_html
drwxr-xr-x. 8 parson csit_faculty 20480 Aug 23 16:47 /home/kutztown.edu/parson/public_html

If you do not see the ~/public_html directory do this:

$ mkdir ~/public_html

If you do not see "r-x" in the bottom 3 permission characters do this:

$ chmod o+r+x ~
$ chmod o+r+x ~/public_html

Check permissions again with the "ls" command per above instructions.