CSC 523 -
Scripting for Data Science, Fall 2020, Tu 6-8:50 PM in Old
Main 158.
Dr.
Dale E. Parson Class will be
live on-line at class time via Zoom. Please read
student instructions here.
Mon 6-8:50 PM, Zoom classes & recordings, http://faculty.kutztown.edu/parson
Class-time Zoom link for CSC523: See D2L Course CSC523 ->
Content -> Overview for the link.
TO WATCH RECORDINGS AFTER 11/6, go here https://kutztown.zoom.us/
and Sign In using KU login.
IF you don’t want to be recorded or are a minor,
use PRIVATE ZOOM CHAT to me for questions.
Please fill out & email Dr. Parson this permission to record slip.
I will use it to take attendance in week 1.
Normal Office Hours 11/30-12/3: Monday 1-2, Tuesday 3:30-4:30,
Wednesday 12-2, Thursday 3:30-4:30 or by appt.
Final week office hours 12/7-12/10: Monday 1-3, Tuesday
& Thursday 11-12:30
See your course page for final exam work-session schedule.
KU Campus Mask policy: Resident students must wear a mask
anytime they are outside of their personal room and within a
building or with anyone else but their roommate. Commuter
students must wear a mask anytime they are on campus within a
building or with anyone. The course is 100% via Zoom at class
time. I will record & post class videos, but want you there
at class time.
PA: The Secretary's Order requires individuals to wear a
face covering, in both indoor public places and in the outdoors
when they are not able to consistently maintain social
distancing from individuals who are not members of their
household, such as on a busy sidewalk, waiting in line to enter
a place, or near others at any place people are congregating.
Whether inside in a public place or outside, and when wearing a
face covering or not, everyone should socially distance at least
6 feet apart from others who are not part of your household.
Dr. Dale E. Parson, parson@kutztown.edu, Office hours: https://kutztown.zoom.us/j/94322223872
Office Hours Monday 1-2, Tuesday 3:30-4:30, Wednesday 12-2,
Thursday 3:30-4:30 or by appt.
KU offers a 4-course Graduate
Certificate in Data Analytics. Talk with me if you
want to sign up.
Deloitte has been recruiting Data
Scientists (including graduate level) and Data
Analysts (undergrad only) in Mechanicsburg, PA
First day handout
(syllabus that is specific to this semester).
I commit to using each student's preferred name and preferred
gender pronoun. Feel free to contact me in private if I
make mistakes in pronunciation, name, gender, or anything
else. Thanks! Here is a poll to which you can reply
privately on paper or via email.
Gender-Based Crimes
Educators must report incidents of gender-based crimes,
including sexual assault, sexual harassment, stalking, dating
violence, and domestic violence. If a student discloses
such incidents to me during class or in a course assignment, I
am not required to report the disclosure, unless the student
was a minor at the time the incident occurred.
Regardless of the student’s age, if the incident is disclosed
to me outside the classroom setting or a course assignment, I
am required by law to report the disclosure, including
relevant details, such as the names of those involved in the
incident, to Public Safety and Police Services and to Mr.
Jesus Peña, Title IX Coordinator.
Jesus A. Peña, Esq.
Deputy to the President for Compliance, Equity & Legal
Affairs
(610) 683-4700
pena@kutztown.edu
There is a 10% per late late penalty for projects that come in
after the due date.
RESOURCES & HANDOUTS.
Link to the Fall
2019 CSC458 course.
Here
is the Anaconda site from which you can download MOST
of the software tools we will use this semester.
You can also do all of your development
on acad. You will have to turn solutions in as source .py
files on acad.
Windows users can download the WinSCP file transfer
client in the Computer Science
sub-menu below here.
I have read reports of
adware being bundled with the FileZilla installer. I have
used FileZilla for years with no problem.
We will be using Python
3.x. I will use IPython
in lecture. You can use any interactive Python environment you
like.
You will turn in projects as stand-alone
PROJECT.py scripts, with tests driven by my makefiles or my
Python scripts.
How
to Think Like a Computer Scientist looks like a good
tutorial for Python newbies.
Dr.
Schwesinger has posted some additional Python materials
for CSC223.
Python
regular expressions; a Python regular expression test harness.
Here is a fall 2017 assignment 1
for CSC458 that serves as an intro to Python module re.
Scikit-learn
will be the primary library for several of our projects.
We may need to install libraries from SciPy.org or Anaconda.
Each project will outline its library requirement.
Using Notepad++: Go to
Settings->Preferences...->Language
(since version 7.1) or Settings->Preferences...->Tab
Settings (previous versions)
Check Replace by space
To convert existing tabs to spaces, press
Edit->Blank Operations->TAB to Space.
If you are a vim editor user,
create a file called .vimrc in your login directory with
the following lines:
set ai
set ts=4
set sw=4
set expandtab
set sta
ASSIGNMENTS
There is a 10% per late late penalty for projects that come in
after the due date.
Assignment 1 on Python regular expressions for data extraction
due 11:59 PM on September 24 (changed from 17th) via make
turnitin.
Here are my 9/21 class notes on Assignment
1 parts 6 & 8, updated to remove bugs on 9/22.
Assignment 2 is a Python data
preparation & classification problem from a csc558
project in Weka, due 11:59 PM October 22.
It derives from CSC558
Assignment 1 last semester. Here is the audio
signal data overview.
My example Python code for a
different, example csc458 problem is on acad at
~parson/DataMine/CSC523Example2.zip.
Here is an ipython
interactive trace that I used in working on this example
code. Here is another for
9/21 on sklearn.
A graph
on information entropy, relates to building rules &
decision trees.
Here is my
page on interpreting the Kappa statistic needed for
Assignment 2.
Here is a page
comparing information entropy to gini index in deciding
when/how to branch decision trees.
A page
describing Bayes theorem and related matters. Sklearn's
GaussianNB is an implementation of Naive Bayes.
A Bayes computer for a 52-card deck is on acad at
~parson/DataMine/BayesCards.py
Chapter
4 of Weka textbook includes Naive Bayes &
instance-based overviews.
Chapter
12 of Weka textbook introduces ensemble learning.
Assignment
3 is a Python data preparation & numerical
correlation problem from a csc558 project in Weka.
Due by 11:59 PM on Thursday November 12
via make turnitin.
Please see my email from Oct 27 AM
regarding reduction in regressors/test time.
Answer handout for spring
2020 CSC558 Assignment 2.
Chapter
4 slides include linear models & model trees
I adapted slides
for Linear Regression and M5P Trees, 10/30/2017.
Evaluating
numeric prediction and Minimum Description Length from
Chapter 5.
My
amended slides on Minimum Description Length and
Evaluating Numeric Prediction.
Here is an IBM
site overview of interpreting linear regression.
Assignment 4 is a new,
never before used Time Series project, due via make
turnitin by 11:59 PM Thursday December 3.
Some slides
on time series.
My solution
to Spring 2020 csc558 Assignment 3 on time series.
Assignment 5 is the final
exam project, for which I will answer questions for
clarification only in the 11/30 and 12/7 classes.
It is due by 11:59 PM on Saturday
December 12. It is based on the CSC558
Assignment 3 from spring 2020.
Here is my
ad hoc joinARFF example from 12/7.
ZOOM RECORDING ARCHIVES
August
24 Overview of the course, first day handout, demo &
discussion of COVID@KU graphical simulation.
August
31 class, went over two
regular expression assignments from previous semesters and
started on Assignment 1.
September 1 office hours (recorded with
attendees' permission), Q&A on finding your regular
expressions in assn1.
September 3 office hours I went over Assn1 Part
6 mini-lecture on using Python dictionaries to aggregate full TCP
streams.
September 14 class, went over prep for
Assignment 2 (see above), then had Q&A on Assignment 1 last
hour of class.
September 16 office hours, working through
processing the final data attribute for User Datagram Protocol
data.
September
21 class, went over above
Weka assignment as it relates to scikit-learn, last hour was
Q&A on Assignment 1 parts 6 & 8.
Here are
my class notes on Assignment 1 parts 6 & 8, updated to remove
bugs 9/22.
September
28 class, went over my Hawk Mtn. analysis in sklearn under
Assn2 above, also kappa, information entropy, & gini measures.
October
5 class, went over most of Assignment 2 handout. KU VPN (Virtual Private
Network) software is here.
Once on VPN, you can ssh or
putty directly into mcgonagall.kutztown.edu.
October
12 class, completed look at Assn2, also slides on
conditional probabilities, instance-based learning, & ensemble
learning.
October
19 class, inspecting output analysis of assn1, going over
predicting numeric attributes & their error measures, work
session.
October
20 debugging session, the out "for(ATTR,...):" loop in
analyze() through successful testing.
October
26 class, went over Assignment 3. Please see my email
from Oct 27 AM regarding reduction in regressors/test time.
November
2 class, finished going over Assignment 3 handout code &
results, then work session until 9 PM.
November
9 class, started discussion of Time Series Analysis, 50
minutes working time for project 3 at the end.
November
16 class, hand out & go over next, Assignment 4.
November
30 class, hand out & go over Assignment 5, Q&A on
Assignment 4.
December
7 class, went over some pitfalls from previous projects,
some Q&A on Assignment 5, especially STUDENT A joinARFF(...).
Here is my ad hoc joinARFF example
from 12/7.