CSC 523, Scripting for Data Science, Fall 2022, Lecture Notes for Assignment 2 Prep

CSC 523 - Scripting for Data Science, Fall 2022, Monday 6-8:50 PM in Old Main 158.
Lecture notes for Assignment 2 Preparation, 9/19/2022.

On acad or mcgonagall see ~parson/DataMine/CSC523HMfall2022.demo.zip or CSC523HMfall2022/.
We may pull some regressors out of ~parson/DataMine/whitenoise523fall2020.solution.zip or whitenoise523fall2020/.
All of our testing will now be on mcgonagall (ssh mcgonagall, sometimes this hangs, requiring control-C & retrying).

Here is what is in that directory:

CSC523HMfall2022_main.py                                       Command-line parsing, model building & testing from input data.
CSC523HMfall2022_generator.py                                Generator that iteratively supplies regressor & data to the above main module.
day_aggregate_HMS_1976_2021_kupapcsit01.arff    Input dataset for this project from Hawk Mountain daily observations Aug-Dec, 1976-2021.
arfflib_3_2.py                                                                Parson's library for manipulation of Weka ARFF and CSV datasets.
makefile and makelib                                                   Test drivers as usual.
CSC523HMfall2022Out.txt.ref                                     The reference files hjold expected output for testing.
CSC523HMfall2022Trace.txt.ref
CSC523HMfall2022Out.sorted.txt.ref
fakegen.py                                                                    A trivial example of a Python generator.
__init__.py                                                                    Identifies modules in this directory for importing.
diffarff.py                                                                       Compares ARFF files for "almost equals" on numeric fields. Not yet used.
plotcsv_1_2.py                                                            Data plotter to display or PNG files. Not yet used.
__pycache__                                                                Directory for auto-compiled Python bytecode files (.pyc extension).

************************************************************************************************************************************
Start with CSC523HMfall2022_main.py. This wil be stable (no STUDENT work) in Assignment 2.

USAGE: python CSC523HMfall2022_main.py MODELPRINTOUT.txt MODULENAME INFILE(.arff|.csv) OUTFILE(.arff|.csv) MODULEARGS...
CSC523HMfall2022Trace.txt is MODELPRINTOUT.txt for this assignment, printout of model details.
MODULENAME is CSC523HMfall2022_generator defined in CSC523HMfall2022_generator.py.
    Function generate(...) is a generator that generates test regressors and data back to __main__.
__main__ verifies command line arguments, imports generate(...), and calls it to initialize the generator.
__main__ then loops over the 9-tuples from the generator, one at a time, to apply a regressor to data via helpAnalyze(...).
helpAnalyze(...) uses regressor.fit(...) to build a model.
helpAnalyze(...) uses regressor.predict(...) to predict test values based on this model.
Calls to wekaCorrelationCoefficent, mean_squared_error, and mean_absolute_error get the standard accuracy / error measures.
    CC is on a scale of [0.0, 1.0], with 1.0 being perfect correlation and possibly overfit.
    These slides on Evaluating Numeric Prediction discuss these measures.
Supplied printResults(...) prints results to text files and linear regression formulas and tree structures to CSC523HMfall2022Trace.txt.ref.

************************************************************************************************************************************
Generator in CSC523HMfall2022_generator.py supplies the test data and regressors. Assignment 1 STUDENT work will go here (not yet).
There is a mountain of import statements from CSC523 Fall 2020, some of which we will use. I wil sumarize these, e.g.:

In [4]: import arfflib_3_2
In [5]: help(arfflib_3_2.readARFF)
Help on function readARFF in module arfflib_3_2:
readARFF(fname)
    Reads ARFF file named fname and returns (attrmap, dataset), where
    attrmap is the map from attrname -> (offset, type) returned by
    __getAttrIndices__, and dataset is a 2D list indexed on [row][offset]
    that holds actual data instances.
    This offset is attribute position, starting at 0, and type is
    one of a date-3-tuple, 'numeric', 'string', a nominal set in {} delimiters,
    or a ARFF datetime value. A nominal type field is a 3-tuple of
    ('nominal', {NOMINAL_LIST_IN_STRING_FORM}, PYTHON_LIST_OF_NOMINAL_SYMBOLS),
    and a datetime (Weka date) is a 3-tuple consisting of
    ('date', Weka-format-string, Python-datetime-strptime-format-string).
    A nominal attribute-value in the dataset is a simple string as read from an ARFF
    file, and a date attribute-value is a 2-tuple
    (STRING_VALUE, Python datetime.datetime object).

My PART 1 generate(arglist) does the following.

1.1 It reads in the input dataset.
1.2 It builds some sample regressors.
Sklearn docs on LinearRegression and DecisionTreeRegressor.
1.3 It extracts attributes yearSince1976, daySinceAug1, and HMtempC for comparison to Weka's analyses of these data (Fig 3-5).
1.4 It iterates over (dataName, regressor) pairs to pass these + data, one pass at a time, back to CSC523HMfall2022_main.py for model building & testing.

I will write my solution after that, yours will follow. We will discuss Assignment 2 on 9/26.
************************************************************************************************************************************