CPSC 523 - Scripting for Data Science, Fall 2024,
Tuesday 6:00-8:50 PM, Old Main 158. Assignment 5 is due via make turnitin on acad or K120023GEMS by 11:59 PM on
Sunday December 15.
Assignment focuses on Python regular expressions and namedtuples.
Late charge is the standard 10% per day. Assignments must be in on
time to meet semester grading deadline.
We will discuss in the 11/26 class. There is no README in this final assignment.
************************************************************
NOTES FROM 12/3 CLASS POSTED 12/4:
1. To get the parsing-input files into directory rawAnalysis/,
just run make test from the command line.
You will see errors regarding the unfinished
parsing code, but it will populate rawAnalysis.
2. Make sure to quote your r'REGULAR EXPRESSIONS' in STUDENT 3
with r' at the start.
r' is a raw quote that keeps Python
meta-characters intact for re.compile(...).
See my examples above STUDENT 3.
3. There is no cmatrix field in __RegressResults__ for STUDENT
2.
Look in
reffiles/t_inFreqType_tfreq_M5P.txt for fields in ModelResults:. **************************************************************
Perform the following steps on K120023GEMS.kutztown.edu:
cd
# places you into your login directory mkdir DataMine
# all of your csc523 projects go into this directory cd ./DataMine
# makes DataMine your current working directory, it may already
exist cp ~parson/DataMine/CSC523F24Regexp5.problem.zip
CSC523F24Regexp5.problem.zip unzip CSC523F24Regexp5.problem.zip
# unzips your working copy of the project directory cd ./CSC523F24Regexp5
# your project working directory
Perform all test execution on K120023GEMS.kutztown.edu to avoid
any platform-dependent output differences.
Here are the files of interest in this project directory. There
are a few you can ignore. CSC523F24Regexp5.py # your Weka-output
parsing work goes here
^^^ Handout code parses Weka classification
output.
^^^ You will integrate regression parsing into
helper function classify(). makefile
# the Linux make utility uses this script to direct testing makelib
# my library for the makefile diffcsvWgz.py, diffParsedFiles.sh, and arfflib_3_6.py
are used in testing.
Files in directory rawAnalysis/ are Weka output text files
that our code parses.
Files in directory parsedAnalysis/ show the results of our
parsing.
Files in directory reffiles/ show the results of correct
parsing.
Files in directory tempDataFiles/ are temporary ARFF files
supplied as input to Weka.
ARFF means Attribute Relation File Format,
i.e., CSV files with a typed header-per-line.
Any make test, make clean, or make turnitin
will remove any generated file.
Here is the summary of STUDENT coding steps.
# STUDENT 1
(4%) Complete the above comment block.
# STUDENT 2 (16%) Construct namedtuple __RegressResults__ using
key
# STUDENT 3 (16%) Complete the following 2 regexps per example
data lines.
# STUDENT 4 (16%) Integrate ModelKey M5P, M5PNode2500,
# STUDENT 5 (16%) Also exit this while on 'Correlation
coefficient'.
# STUDENT 6 (16%) Use __PearsonrCC__.search below to set
PearsonrCC:
# STUDENT 7 (16%) Set result by calling __RegressResults__(...)
Make sure that all make test tests pass before you
run make turnitin.
Here is what make test reports from the handout code.
$ make test
DEBUG type(parsedResults) <class 'NoneType'>
TESTS ran in 28.49 secs, 7.76608 processor srcs.
TESTS ran in 28.49 secs, 7.76608 processor srcs.
PYTHONPATH=. /usr/bin/python3.11 diffcsvWgz.py
tempDataFiles/CSC558Assn2Dupl.arff
/home/kutztown.edu/parson/DataMine/CSC558Assn2Dupl.arff.gz
bash diffParsedFiles.sh
ERROR detected in t_inFreqType_tfreq_DecisionTableR.txt
COMPARE ./parsedAnalysis/t_inFreqType_tfreq_DecisionTableR.txt
reffiles/t_inFreq
Type_tfreq_DecisionTableR.txt
-rw-r--r--. 1 parson domain users 73584 Nov 26 10:10
t_inFreqType_tfreq_DecisionTableR.txt.txt.dif
ERROR detected in t_inFreqType_tfreq_LinearRegression.txt
COMPARE ./parsedAnalysis/t_inFreqType_tfreq_LinearRegression.txt
reffiles/t_inFr
eqType_tfreq_LinearRegression.txt
-rw-r--r--. 1 parson domain users 1726 Nov 26 10:10
t_inFreqType_tfreq_LinearRegression.txt.txt.dif
ERROR detected in t_inFreqType_tfreq_M5PNode2500_T_inFreqType.txt
COMPARE
./parsedAnalysis/t_inFreqType_tfreq_M5PNode2500_T_inFreqType.txt
reffile
s/t_inFreqType_tfreq_M5PNode2500_T_inFreqType.txt
-rw-r--r--. 1 parson domain users 985 Nov 26 10:10
t_inFreqType_tfreq_M5PNode2500_T_inFreqType.txt.txt.dif
ERROR detected in t_inFreqType_tfreq_M5PNode2500.txt
COMPARE ./parsedAnalysis/t_inFreqType_tfreq_M5PNode2500.txt
reffiles/t_inFreqTyp
e_tfreq_M5PNode2500.txt
-rw-r--r--. 1 parson domain users 984 Nov 26 10:10
t_inFreqType_tfreq_M5PNode2500.txt.txt.dif
ERROR detected in t_inFreqType_tfreq_M5P.txt
COMPARE ./parsedAnalysis/t_inFreqType_tfreq_M5P.txt
reffiles/t_inFreqType_tfreq_
M5P.txt
-rw-r--r--. 1 parson domain users 77930 Nov 26 10:10
t_inFreqType_tfreq_M5P.txt.txt.dif
Here is what make test reports after a successful run:
$ make test
TESTS ran in 28.59 secs, 8.00605 processor srcs.
TESTS ran in 28.59 secs, 8.00605 processor srcs.
PYTHONPATH=. /usr/bin/python3.11 diffcsvWgz.py
tempDataFiles/CSC558Assn2Dupl.arff
/home/kutztown.edu/parson/DataMine/CSC558Assn2Dupl.arff.gz
bash diffParsedFiles.sh
Tests PASS