CSC 458 Assignment 4 Data Cleaning Project in
Python due date via D2L Assignment 4 is Friday April
12 by 11:59 PM.
10% per-day late penalty and no points after I go over my solution
in the next class.
Download Assn4Python.problem.zip via
this link.
Linux programmig veterans can also find Assn4Python.problem.zip on acad in
~parson/DataMine/.
Unzip the zip file, yielding these text files:
NotMyKU.txt
Input data from Banner obtained via
control-A (all), then control-C (copy),
and pasted manually into
this text file.
NotMyKUParsed.ref
The reference copy of
output text file NotMyKUParsed.txt
created by running
CSC458sp24Assn4.py via make test.
NotMyKU.csv.ref
The reference copy of output
text file NotMyKUParsed.csv
created by running
CSC458sp24Assn4.py via
make test. This is a
comma-separated-value (CSV)
file that Excel can read.
CSC458sp24Assn4.py
This is the Python script that generates the two
output files from the input file,
with the prefix for
the input file such as "NotMyKU" specified on its command line.
README.assn4.txt
ALL OF YOUR ANSWERS GO INTO THIS
FILE TO BE TURNED IN VIA D2L.
The questions are all about
analysis of regular expressions, data formats, and code.
There is no coding.
makefile
Running make test
drives testing. You do not need to do this.
makelib
A library for the makefile.
You will turn in README.assn4.txt with your answers via
D2L.
If you do bonus step Q11a, turn in your working CSC458sp24Assn4.py
for which make test does not break the
diff of NotMyKUParsed.txt.
If you do bonus step Q11b,
turn in your Extra.txt
file as described below.
TWO OPTIONAL 10-POINT BONUS STEPS. YOU CAN DO NEITHER OR EARN 10
POINTS BY DOING
ONE. YOU CAN DO BOTH BUT YOU STILL ONLY EARN AT MOST 10
BONUS POINTS.
1. Weka
chokes reading NotMyKU.csv with the following error message
(Bonus STEP Q11a):
However, Excel reads it just fine as intended:
These 10 bonus points come if you can enhance
CSC458sp24Assn4.py to
create a NotMyKU.csv
output
file that Weka
can read with breaking existing tests. You will get minor diffs on
NotMyKU.csv that
you should verify with me.
You must not get diffs on NotMyKUParsed.txt.
Add a Q11a to README.assn4.txt explaining your
code change and turn in CSC458sp24Assn4.py in addition
to
README.assn4.txt .
BONUS STEP Q11b:
2. Collect and
parse data for 2 additional courses from Banner, one with a
linked lab section.
Go into new MyKU and search for courses from two different majors,
one with a linked lab section similar to GEOL 110
in NotMyKU.txt. Copy them within Banner using control-A, paste
them into a text file called Extra.txt, and then run
a Python command like this, with Python 3.x installed on Windows
(CMD LINE):
C:\Users\parson>cd Downloads\CSC458sp24Assn4Python
C:\Users\parson\Downloads\CSC458sp24Assn4Python>dir
Volume in drive C has no label.
Volume Serial Number is EE27-DF9A
Directory of C:\Users\parson\Downloads\CSC458sp24Assn4Python
03/29/2024 06:22 PM
<DIR>
.
03/29/2024 06:22 PM
<DIR>
..
03/29/2024 05:49
PM
3,033 CSC458sp24Assn4.py
03/29/2024 06:22
PM
75,318 Extra.txt
03/29/2024 05:49
PM
871 makefile
03/29/2024 05:49
PM
3,018 makelib
03/29/2024 06:07
PM
44,007 NotMyKU.csv
03/29/2024 05:49
PM
44,007 NotMyKU.csv.ref
03/29/2024 05:49
PM
75,318 NotMyKU.txt
03/29/2024 05:49
PM
45,691 NotMyKUParsed.ref
03/29/2024 05:49
PM
5,346 README.assn4.txt
9 File(s) 296,609 bytes
2 Dir(s) 288,453,521,408 bytes free
C:\Users\parson\Downloads\CSC458sp24Assn4Python>python
CSC458sp24Assn4.py Extra
C:\Users\parson\Downloads\CSC458sp24Assn4Python>dir
Volume in drive C has no label.
Volume Serial Number is EE27-DF9A
Directory of C:\Users\parson\Downloads\CSC458sp24Assn4Python
03/29/2024 06:23 PM
<DIR>
.
03/29/2024 06:23 PM
<DIR>
..
03/29/2024 05:49
PM
3,033 CSC458sp24Assn4.py
03/29/2024 06:23
PM
44,007 Extra.csv
03/29/2024 06:22
PM
75,318 Extra.txt
03/29/2024 06:23
PM
45,691 ExtraParsed.txt
03/29/2024 05:49
PM
871 makefile
03/29/2024 05:49
PM
3,018 makelib
03/29/2024 06:07
PM
44,007 NotMyKU.csv
03/29/2024 05:49
PM
44,007 NotMyKU.csv.ref
03/29/2024 05:49
PM
75,318 NotMyKU.txt
03/29/2024 05:49
PM
45,691 NotMyKUParsed.ref
03/29/2024 05:49
PM
5,346 README.assn4.txt
11 File(s) 386,307 bytes
2 Dir(s) 288,453,447,680 bytes free
On UNIX here are the command lines:
$ cd /cygdrive/c/Users/parson/Downloads/CSC458sp24Assn4Python
parson@DESKTOP-14DQ9MR
/cygdrive/c/Users/parson/Downloads/CSC458sp24Assn4Python
$ ls
CSC458sp24Assn4.py NotMyKU.csv.ref
README.assn4.txt
Extra.txt
NotMyKU.txt makefile
NotMyKU.csv
NotMyKUParsed.ref makelib
$ python CSC458sp24Assn4.py Extra
$ ls
CSC458sp24Assn4.py ExtraParsed.txt
NotMyKU.txt makefile
Extra.csv
NotMyKU.csv NotMyKUParsed.ref
makelib
Extra.txt
NotMyKU.csv.ref README.assn4.txt
Examine the contents of output file ExtraParsed.txt and
determine whether
every course record in Extra.txt
appears correctly in ExtraParsed.txt.
If a course record is missing or malformed in ExtraParsed.txt,
use Pythex
to try to find an enhancement to our regular expression that
will parse
the failed course data. You are not obligated to solve the
mismatch
problem, but keep it in Extra.txt so I can see it and describe
it in Q11.b.
Add a Q11b to README.assn4.txt
explaining what you found.
If you find a regular expression that works with a failed
course,
give the regular expression and describe its change from the
handout.
You might not find problematic data. That's OK.
Turn in Extra.txt in addition to README.assn4.txt .