1 studentid suid 2 student-year syea (Sophomore, Junior, Senior) 3 student-track strk (SD or IT or OT for other) 4 course cour 5 semester seme 6 project-number prjn 7 project-start-datetime prjs 8 project-end-datetime prje 9 assigned-until-started-hours Hstr (round to nearest hour) 10 completed-until-due-hours Hend (round to nearest hour) 11 started-until-due-hours Jstr (round to nearest hour) 12 Jstr - hours lost to skipped days Jfst (Jstr - 24 * each skip) 13 assigned-until-completed-hours Jend (round to nearest hour) 14 started-until-completed-hours Jall (round to nearest hour) 15 min-session-time-minutes Mmin (session gap of >= 60 mins) 16 max-session-time-minutes Mmax 17 mean-session-time-minutes Mavg 18 stddev-session-time-minutes Mdev 19 median-session-time-minutes Mmed 20 mode-session-time-minutes Mmod (round to nearest 15) 21 mean-time-between-sessions-hours Havg 22 stddev-time-between-sessions Hdev 23 min-session-files Fmin 24 max-session-files Fmax 25 mean-session-files Favg 26 stddev-session-files Fdev 27 median-session-files Fmed 28 mode-session-files Fmod 29 min-session-bytes Ymin 30 max-session-bytes Ymax 31 mean-session-bytes Yavg 32 stddev-session-bytes Ydev 33 median-session-bytes Ymed 34 mode-session-bytes Ymod (round to nearest 1000) 35 min-session-lines Lmin (may need to use ?) 36 max-session-lines Lmax (may need to use ?) 37 mean-session-lines Lavg (may need to use ?) 38 stddev-session-lines Ldev (may need to use ?) 39 median-session-lines Lmed (may need to use ?) 40 mode-session-lines Lmod (round 20, need to use ?) 41 min-session-added Amin (may need to use ?) 42 max-session-added Amax (may need to use ?) 43 mean-session-added Aavg (may need to use ?) 44 stddev-session-added Adev (may need to use ?) 45 median-session-added Amed (may need to use ?) 46 mode-session-added Amod (round 20, need to use ?) 47 min-session-deleted Dmin (may need to use ?) 48 max-session-deleted Dmax (may need to use ?) 49 mean-session-deleted Davg (may need to use ?) 50 stddev-session-deleted Ddev (may need to use ?) 51 median-session-deleted Dmed (may need to use ?) 52 mode-session-deleted Dmod (round 20, need to use ?) 53 min-session-changed Cmin (may need to use ?) 54 max-session-changed Cmax (may need to use ?) 55 mean-session-changed Cavg (may need to use ?) 56 stddev-session-changed Cdev (may need to use ?) 57 median-session-changed Cmed (may need to use ?) 58 mode-session-changed Cmod (round 120, to use ?) 59 number-sessions Snum 60 total-session-time-minutes Mtot 61 number-sessions-centered-hour0-3 S0003 62 number-sessions-centered-hour4-7 S0407 63 number-sessions-centered-hour8-11 S0811 64 number-sessions-centered-hour12-15 S1215 65 number-sessions-centered-hour16-19 S1619 66 number-sessions-centered-hour20-23 S2023 67 mean-compete-csc-projects-assign Xasn 68 mean-compete-csc-projects-due Xdue 69 mean-compete-exams Xams 70 number-builds-started Bsta 71 number-builds-completed Bend 72 number-tests-unix-started Tstx 73 number-tests-unix-completed Tenx 74 number-tests-pc-started Tstp (Tests on student's machine.) 75 number-tests-pc-completed Tenp (Tests on student's machine.) 76 total-tests-started Tstb (Both Unix & PC test starts.) 77 total-tests-completed Tenb (Both Unix & PC test end.) 78 post-turnitin-make-actions Ptis 79 clued-emails Eyes 80 clueless-emails Enot 81 total-emails Etot 82 grade point average at start Cumg 83 number credits at start semester Crdg 84 grade point average in csc >= 125 Cumm 85 number credits in csc >= 125 Crdm 86 course-numeric-grade Gcrs 87 course-letter-grade Glet 88 project-numeric-grade Gprj 89 project-letter-bin Gplt (3 bands per grade) 90 course-percentile-grade GcrsRank (addrank.py for a prjn) 91 project-percentile-grade GprjRank (addrank.py for a prjn) GprjMinMax.py extracts attributes 92, 93, 94, 95 (for semester of projects) 92 max-grade-for-student-course GprjMax (added 9/17/2014) 93 min-grade-for-student-course GprjMin (added 9/17/2014) 94 spread-grade-for-student-course GprjSpr (added 9/17/2014) 95 my-grade-for-student-course GprjMy (added 9/17/2014) addmyhistory.py extracts attributes 96, 97 (run on a full semester) 96 my-average-grade-previous-projs Gprv (added 2/10/2015) 97 my-stddev-grade-previous-projs Gsdv (add 4/2/2015) 98 sessions centered in a Sunday SSun (add 5/12/2015) 99 sessions centered in a Monday SMon (add 5/12/2015) 100 sessions centered in a Tuesday STue (add 5/12/2015) 101 sessions centered in a Wednesday SWed (add 5/12/2015) 102 sessions centered in a Thursday SThu (add 5/12/2015) 103 sessions centered in a Friday SFri (add 5/12/2015) 104 sessions in 0 days from Wed SWed0 (add 5/12/2015) 105 sessions in 1 days from Wed SWed1 (add 5/12/2015) 106 sessions in 2 days from Wed SWed2 (add 5/12/2015) 107 sessions in 3 days from Wed SWed3 (add 5/12/2015) The following pseudo-attributes added for incremental updates to a visualizer. These do not go into an ARFF file. 9/27/2014 A prj duration, (prje-prjs) in hours prjd B prj time since beginning in hours prjt, datetime-prjs in hours NOTES: Starting in class the evening of Thursday 9/19 we will start this work. For now please read the doc and formulate any questions. We will analyze ARFF file csc243sp2013prj2to5.arff. 1. Any attribute containing ? as a value in this dataset can and probably should be discarded on initial analysis. Find the grey cells in Weka's EDIT window. That includes mode attributes, because there is not always an unambiguous mode. It includes line data (lines changed/added/deleted), and surveys (because of survey data collection errors), and probably others. 2. Of the string data, studentid should be removed, and the others should be nominalized using filter StringToNominal. 3. Attributes Gcrs, Glet and GcrsRank are redundant with each other, giving different views of the same data. You can keep at most one at a time, or the algorithms will infer one from the others. Gprj, Gplt and GprjRank are the same for the project. GcrsRank and GprjRank are numeric centile ranks for the course and project respectively. They may be the very useful since they expand clumped grade concentrations, and can be Discretized into (10?) bins for J48, NaiveBayes and other classifiers requiring nominal targets. 4. Looking back through the spring csc243 dataset with Weka in September, I am surprised to see OneR outperforming J48 in various basic investigations. Apparently, J48 is being confused by ambiguous data. I don't remember that from my quick look this summer. 5. One approach is to use OneR to the find the most use predictive attribute, remove that attribute, then see what the second-most predictive attribute is, then remove that. This approach will give you a set of perhaps up to 10 of the most predictive attributes. Then you can throw out all the others, keep those 10, and use more powerful algorithms such as J48, NaiveBayes or M5P / M5Rules on those attributes to see how they fare. The number 10 is just a guess. Too few means throwing away too much data; too many become hard to interpret. 6. My final suggestion for now is to see what you can use to predict Gplt, and Gprj, GprjRank, and a Discretized GprjRank, one at a time. Gplt and a Discretized GprjRank are nominal and therefore amenable to OneR, J48, NaiveBayes and RandomTree. Gprj and GprjRank are numeric and therefore amenable to M5P, M5Rules, and SimpleKMeans clustering (among others). Creating enough clusters to show at least 4 different grade levels in the target attribute actually looks like it might be useful. 7. May 16, 2014 added Jfst which is Jstr - 24 hours * number of days skipped work between the start and the final turnitin. 8. To get GprjMax and GprjMin from Gprj run script GprjMinMax.py, added 9/17/2014. This is a max and min of Gprj for the student from a single course. This script can also filter out students with grade spreads below a certain percentage or with min grades >= a certain point. Also added GprjSpr = (GprjMax-GprjMin) and GprjMy = (Gprj/GprjMax). 9. Added attributes A and B for the visualizer, 9/27/2014. 10. Added Gprv which is this student's mean grade for all project's in this semester (course) before current project, '?' for the first project of the semester. 11. Add Gsdv which is standard deviation of Gprv April 3, 2015. 12. Attributes 98 through 107 for day-of-wwek (Sunday is 0) and days-from-Wednesday (either side, wed is 0, weekend days 3) added May 12, 2015.