CSC 523 -
Scripting for Data Science, Fall 2022, Monday 6-8:50 PM in Old
Main 158.
Lecture notes for Assignment 2 Preparation, 9/19/2022.
On acad or mcgonagall see ~parson/DataMine/CSC523HMfall2022.demo.zip
or CSC523HMfall2022/.
We may pull some regressors out of ~parson/DataMine/whitenoise523fall2020.solution.zip
or whitenoise523fall2020/.
All of our testing will now be on mcgonagall (ssh
mcgonagall, sometimes this hangs, requiring control-C &
retrying).
Here is what is in that directory:
CSC523HMfall2022_main.py
Command-line
parsing, model building & testing from input data.
CSC523HMfall2022_generator.py
Generator that iteratively supplies regressor
& data to the above main module.
day_aggregate_HMS_1976_2021_kupapcsit01.arff
Input dataset for this project from Hawk Mountain daily
observations Aug-Dec, 1976-2021.
arfflib_3_2.py
Parson's library for manipulation of Weka
ARFF and CSV
datasets.
makefile and makelib
Test drivers
as usual.
CSC523HMfall2022Out.txt.ref
The reference files
hjold expected output for testing.
CSC523HMfall2022Trace.txt.ref
CSC523HMfall2022Out.sorted.txt.ref
fakegen.py
A
trivial example of a Python generator.
__init__.py
Identifies modules in this directory for importing.
diffarff.py
Compares ARFF files for "almost equals" on numeric
fields. Not yet used.
plotcsv_1_2.py
Data plotter to
display or PNG files. Not yet used.
__pycache__
Directory for
auto-compiled Python bytecode files (.pyc extension).
************************************************************************************************************************************
Start with CSC523HMfall2022_main.py. This wil be stable
(no STUDENT work) in Assignment 2.
USAGE: python CSC523HMfall2022_main.py
MODELPRINTOUT.txt MODULENAME INFILE(.arff|.csv)
OUTFILE(.arff|.csv) MODULEARGS...
CSC523HMfall2022Trace.txt is
MODELPRINTOUT.txt for this assignment, printout of model
details.
MODULENAME is CSC523HMfall2022_generator
defined in CSC523HMfall2022_generator.py.
Function generate(...)
is a generator that generates test regressors and data
back to __main__.
__main__ verifies command line arguments,
imports generate(...), and calls it to initialize the generator.
__main__ then loops over the 9-tuples from
the generator, one at a time, to apply a regressor to data via
helpAnalyze(...).
helpAnalyze(...) uses regressor.fit(...) to
build a model.
helpAnalyze(...) uses
regressor.predict(...) to predict test values based on this
model.
Calls to wekaCorrelationCoefficent, mean_squared_error,
and mean_absolute_error get the standard accuracy /
error measures.
CC is on a scale of [0.0, 1.0],
with 1.0 being perfect correlation and possibly overfit.
These
slides on Evaluating Numeric Prediction discuss
these measures.
Supplied printResults(...) prints results
to text files and linear regression formulas and tree structures
to CSC523HMfall2022Trace.txt.ref.
************************************************************************************************************************************
Generator in CSC523HMfall2022_generator.py supplies
the test data and regressors. Assignment 1 STUDENT work
will go here (not yet).
There is a mountain of import statements from CSC523 Fall
2020, some of which we will use. I wil sumarize these, e.g.:
In [4]: import arfflib_3_2
In [5]: help(arfflib_3_2.readARFF)
Help on function readARFF in module arfflib_3_2:
readARFF(fname)
Reads ARFF file named fname and
returns (attrmap, dataset), where
attrmap is the map from
attrname -> (offset, type) returned by
__getAttrIndices__, and dataset
is a 2D list indexed on [row][offset]
that holds actual data
instances.
This offset is attribute
position, starting at 0, and type is
one of a date-3-tuple,
'numeric', 'string', a nominal set in {} delimiters,
or a ARFF datetime value. A
nominal type field is a 3-tuple of
('nominal',
{NOMINAL_LIST_IN_STRING_FORM}, PYTHON_LIST_OF_NOMINAL_SYMBOLS),
and a datetime (Weka date) is a
3-tuple consisting of
('date', Weka-format-string,
Python-datetime-strptime-format-string).
A nominal attribute-value in
the dataset is a simple string as read from an ARFF
file, and a date
attribute-value is a 2-tuple
(STRING_VALUE, Python
datetime.datetime object).
My PART 1 generate(arglist) does the following.
1.1 It reads in the input dataset.
1.2 It builds some sample regressors.
Sklearn docs on LinearRegression
and DecisionTreeRegressor.
1.3 It extracts attributes yearSince1976,
daySinceAug1, and HMtempC for comparison to Weka's
analyses of these data (Fig 3-5).
1.4 It iterates over (dataName, regressor) pairs
to pass these + data, one pass at a time, back to
CSC523HMfall2022_main.py for model building & testing.
I will write my
solution after that, yours will follow. We will discuss
Assignment 2 on 9/26.
************************************************************************************************************************************