CSC 458 Assignment 4 Data Cleaning Project in Python due date via D2L Assignment 4 is Friday April 12 by 11:59 PM.
10% per-day late penalty and no points after I go over my solution in the next class.

Download Assn4Python.problem.zip via this link.
Linux programmig veterans can also find
Assn4Python.problem.zip on acad in ~parson/DataMine/.
Unzip the zip file, yielding these text files:


NotMyKU.txt                      Input data from Banner obtained via control-A (all), then control-C (copy),
                                            and pasted manually into this text file.

NotMyKUParsed.ref          The reference copy of output text file NotMyKUParsed.txt created by running
                                           
CSC458sp24Assn4.py via make test.

NotMyKU.csv.ref               The reference copy of output text file NotMyKUParsed.csv created by running
                                           
CSC458sp24Assn4.py via make test. This is a comma-separated-value (CSV)
                                            file that Excel can read.

CSC458sp24Assn4.py       This is the Python script that generates the two output files from the input file,
                                             with the prefix for the input file such as "NotMyKU" specified on its command line.

README.assn4.txt             ALL OF YOUR ANSWERS GO INTO THIS FILE TO BE TURNED IN VIA D2L.
                                            The questions are all about analysis of regular expressions, data formats, and code.
                                            There is no coding.
makefile                              Running make test drives testing. You do not need to do this.
makelib                               A library for the makefile.

You will turn in README.assn4.txt with your answers via D2L.
If you do bonus step Q11a, turn in your working
CSC458sp24Assn4.py
    for which make test does not break the diff of
NotMyKUParsed.txt.
If you do bonus step Q11b, turn in your Extra.txt file as described below.

TWO OPTIONAL 10-POINT BONUS STEPS. YOU CAN DO NEITHER OR EARN 10 POINTS BY DOING

ONE. YOU CAN DO BOTH BUT YOU STILL ONLY EARN AT MOST 10 BONUS POINTS.

1. Weka chokes reading NotMyKU.csv with the following error message
    (Bonus STEP Q11a):


NotMyKU

However, Excel reads it just fine as intended:

ExcelNoMyKU

These 10 bonus points come if you can enhance CSC458sp24Assn4.py to create a NotMyKU.csv output file that Weka
can read with breaking existing tests. You will get minor diffs on
NotMyKU.csv that you should verify with me.
You must not get diffs on
NotMyKUParsed.txt.

Add a Q11a to README.assn4.txt explaining your code change and turn in
CSC458sp24Assn4.py in addition to
README.assn4.txt .

BONUS STEP Q11b:

2. Collect and parse data for 2 additional courses from Banner, one with a linked lab section.

Go into new MyKU and search for courses from two different majors, one with a linked lab section similar to GEOL 110
in NotMyKU.txt. Copy them within Banner using control-A, paste them into a text file called Extra.txt, and then run
a Python command like this, with Python 3.x installed on Windows (CMD LINE):

C:\Users\parson>cd Downloads\CSC458sp24Assn4Python

C:\Users\parson\Downloads\CSC458sp24Assn4Python>dir
 Volume in drive C has no label.
 Volume Serial Number is EE27-DF9A

 Directory of C:\Users\parson\Downloads\CSC458sp24Assn4Python

03/29/2024  06:22 PM    <DIR>          .
03/29/2024  06:22 PM    <DIR>          ..
03/29/2024  05:49 PM             3,033 CSC458sp24Assn4.py
03/29/2024  06:22 PM            75,318 Extra.txt
03/29/2024  05:49 PM               871 makefile
03/29/2024  05:49 PM             3,018 makelib
03/29/2024  06:07 PM            44,007 NotMyKU.csv
03/29/2024  05:49 PM            44,007 NotMyKU.csv.ref
03/29/2024  05:49 PM            75,318 NotMyKU.txt
03/29/2024  05:49 PM            45,691 NotMyKUParsed.ref
03/29/2024  05:49 PM             5,346 README.assn4.txt
               9 File(s)        296,609 bytes
               2 Dir(s)  288,453,521,408 bytes free

C:\Users\parson\Downloads\CSC458sp24Assn4Python>python CSC458sp24Assn4.py Extra

C:\Users\parson\Downloads\CSC458sp24Assn4Python>dir
 Volume in drive C has no label.
 Volume Serial Number is EE27-DF9A

 Directory of C:\Users\parson\Downloads\CSC458sp24Assn4Python

03/29/2024  06:23 PM    <DIR>          .
03/29/2024  06:23 PM    <DIR>          ..
03/29/2024  05:49 PM             3,033 CSC458sp24Assn4.py
03/29/2024  06:23 PM            44,007 Extra.csv
03/29/2024  06:22 PM            75,318 Extra.txt
03/29/2024  06:23 PM            45,691 ExtraParsed.txt
03/29/2024  05:49 PM               871 makefile
03/29/2024  05:49 PM             3,018 makelib
03/29/2024  06:07 PM            44,007 NotMyKU.csv
03/29/2024  05:49 PM            44,007 NotMyKU.csv.ref
03/29/2024  05:49 PM            75,318 NotMyKU.txt
03/29/2024  05:49 PM            45,691 NotMyKUParsed.ref
03/29/2024  05:49 PM             5,346 README.assn4.txt
              11 File(s)        386,307 bytes
               2 Dir(s)  288,453,447,680 bytes free

On UNIX here are the command lines:

$ cd /cygdrive/c/Users/parson/Downloads/CSC458sp24Assn4Python
parson@DESKTOP-14DQ9MR /cygdrive/c/Users/parson/Downloads/CSC458sp24Assn4Python

$ ls
CSC458sp24Assn4.py  NotMyKU.csv.ref    README.assn4.txt
Extra.txt           NotMyKU.txt        makefile
NotMyKU.csv         NotMyKUParsed.ref  makelib

$ python CSC458sp24Assn4.py Extra

$ ls
CSC458sp24Assn4.py  ExtraParsed.txt  NotMyKU.txt        makefile
Extra.csv           NotMyKU.csv      NotMyKUParsed.ref  makelib
Extra.txt           NotMyKU.csv.ref  README.assn4.txt

Examine the contents of output file ExtraParsed.txt and determine whether
every course record in
Extra.txt appears correctly in ExtraParsed.txt.
If a course record is missing or malformed in
ExtraParsed.txt, use Pythex
to try to find an enhancement to our regular expression that will parse
the failed course data. You are not obligated to solve the mismatch
problem, but keep it in Extra.txt so I can see it and describe it in Q11.b.

Add a Q11b to README.assn4.txt explaining what you found.
If you find a regular expression that works with a failed course,
give the regular expression and describe its change from the handout.
You might not find problematic data. That's OK.

Turn in Extra
.txt in addition to README.assn4.txt .