CSC 558, Assication Rules & Clustering, Spring 2023

CSC 558 - Data Mining and Predictive Analytics II, Spring 2023, Wed 6-8:50 PM in Old Main 158.

Association Rules in Weka (sources: textbook section 3.4 & Appendix 2.6)

An Association Rule states a bidirectional association. There is no class (target) attribute.

A rule has a left-hand side (LHS a.k.a. antecedent, premise) and a right-hand side (RHS a.k.a. consequent).

A rule's coverage (a.k.a. support) is the number of instances it predicts correctly.

A rule's accuracy (a.k.a. confidence) is the ratio of instances: (LHS and RHS are true) / (LHS is true).

Lift is determined by dividing the confidence by the support. (Parson: The divisor appears to be (countLHSorRHS / countTotalInstances)

Leverage is the proportion of additional examples covered by both the premise and the consequent beyond those expected if the premise and consequent were statistically independent.

Conviction, a measure defined by Brin et al. (1997).

"Unlike confidence, conviction is normalized baaed on both the antecedent and the consequent of the rule like the statistical notion of correlation. Furthermore, unlike interest, it is directional and measures actual implication as opposed to co-occurrence." (page 2 of 10)

EXAMPLE 1:

Scheme:       weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1
Relation:     extractAudioFreqARFF_WavPathsModule-weka.filters.unsupervised.attribute.Remove-R66-69-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last-precision6-weka.filters.unsupervised.attribute.Remove-R1-2-weka.filters.unsupervised.attribute.Remove-R1-2,4,6,8,10,12,14-62-weka.filters.unsupervised.attribute.Remove-R2-5
Instances:    10005
Attributes:   3
              ampl3
              ampl8
              toosc
=== Associator model (full training set) ===
Apriori
=======

Minimum support: 0.2 (2001 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 16

Generated sets of large itemsets:
Size of set of large itemsets L(1): 10
Size of set of large itemsets L(2): 7
Size of set of large itemsets L(3): 2

Best rules found:

1. ampl3='(-inf-0.094748]' 4002 ==> ampl8='(-inf-0.062418]' 4002    <conf:(1)> lift:(2.5) lev:(0.24) [2400] conv:(2400.4)
        lift = conf:(1) / (4002 / 10005) = 2.5
2. toosc=SinOsc 2001 ==> ampl3='(-inf-0.094748]' 2001    <conf:(1)> lift:(2.5) lev:(0.12) [1200] conv:(1200.6)
3. toosc=TriOsc 2001 ==> ampl3='(-inf-0.094748]' 2001    <conf:(1)> lift:(2.5) lev:(0.12) [1200] conv:(1200.6)
4. toosc=SqrOsc 2001 ==> ampl3='(0.189492-0.284237]' 2001    <conf:(1)> lift:(5) lev:(0.16) [1600] conv:(1600.8)
      lift = conf:(1) / (2001 / 10005) = 5.0
5. ampl3='(0.189492-0.284237]' 2001 ==> toosc=SqrOsc 2001    <conf:(1)> lift:(5) lev:(0.16) [1600] conv:(1600.8)
6. toosc=SawOsc 2001 ==> ampl3='(0.284237-0.378982]' 2001    <conf:(1)> lift:(5) lev:(0.16) [1600] conv:(1600.8)
7. ampl3='(0.284237-0.378982]' 2001 ==> toosc=SawOsc 2001    <conf:(1)> lift:(5) lev:(0.16) [1600] conv:(1600.8)
8. toosc=SinOsc 2001 ==> ampl8='(-inf-0.062418]' 2001    <conf:(1)> lift:(2.5) lev:(0.12) [1200] conv:(1200.2)
9. toosc=TriOsc 2001 ==> ampl8='(-inf-0.062418]' 2001    <conf:(1)> lift:(2.5) lev:(0.12) [1200] conv:(1200.2)
10. ampl8='(-inf-0.062418]' toosc=SinOsc 2001 ==> ampl3='(-inf-0.094748]' 2001    <conf:(1)> lift:(2.5) lev:(0.12) [1200] conv:(1200.6)

CLUSTERING

EXAMPLE 1 Scheme:       weka.clusterers.EM -I 100 -N -1 -X 10 -max -1 -ll-cv 1.0E-6 -ll-iter 1.0E-6 -M 1.0E-6 -K 10 -num-slots 1 -S 100 Relation:     extractAudioFreqARFF_WavPathsModule-weka.filters.unsupervised.attribute.Remove-R66-69-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last-precision6-weka.filters.unsupervised.attribute.Remove-R1-2-weka.filters.unsupervised.attribute.Remove-R1-2,4,6,8,10,12,14-62 Instances:    10005 Attributes:   7               ampl3               ampl4               ampl5               ampl6               ampl7               ampl8               toosc Test mode:    evaluate on training data === Clustering model (full training set) === EM == Number of clusters selected by cross validation: 4 Number of iterations performed: 0                        Cluster Attribute                    0    1    2    3                          (0.4)(0.2)(0.2)(0.2) ============================================== ampl3 '(-inf-0.094748]'        4003    1    1    1 '(0.094748-0.189492]'       1    1    1    1 '(0.189492-0.284237]'       1 2002    1    1 '(0.284237-0.378982]'       1    1    1 2002 '(0.378982-0.473726]'       1    1    1    1 '(0.473726-0.568471]'       1    1    1    1 '(0.568471-0.663216]'       1    1    1    1 '(0.663216-0.757961]'       1    1    1    1 '(0.757961-0.852705]'       1    1    3    1 '(0.852705-inf)'            1    1 2000    1 [total]                  4012 2011 2011 2011 ampl4 '(-inf-0.090169]'        4003    1    1    1 '(0.090169-0.180334]'       1 2002    1    1 '(0.180334-0.2705]'         1    1    1 2002 '(0.2705-0.360665]'         1    1    1    1 '(0.360665-0.450831]'       1    1    1    1 '(0.450831-0.540997]'       1    1    1    1 '(0.540997-0.631162]'       1    1    1    1 '(0.631162-0.721328]'       1    1    3    1 '(0.721328-0.811493]'       1    1 1998    1 '(0.811493-inf)'            1    1    3    1 [total]                  4012 2011 2011 2011 ampl5 '(-inf-0.084579]'        4003    1    1    1 '(0.084579-0.169155]'       1 2002    1    1 '(0.169155-0.253732]'       1    1    1 2002 '(0.253732-0.338308]'       1    1    1    1 '(0.338308-0.422884]'       1    1    1    1 '(0.422884-0.50746]'        1    1    1    1 '(0.50746-0.592036]'        1    1    3    1 '(0.592036-0.676613]'       1    1 1993    1 '(0.676613-0.761189]'       1    1    7    1 '(0.761189-inf)'            1    1    2    1 [total]                  4012 2011 2011 2011 ampl6 '(-inf-0.079019]'        4003    1    1    1 '(0.079019-0.158035]'       1 2002    1    2 '(0.158035-0.237051]'       1    1    1 2001 '(0.237051-0.316067]'       1    1    1    1 '(0.316067-0.395083]'       1    1    3    1 '(0.395083-0.474098]'       1    1    2    1 '(0.474098-0.553114]'       1    1 1993    1 '(0.553114-0.63213]'        1    1    6    1 '(0.63213-0.711146]'        1    1    1    1 '(0.711146-inf)'            1    1    2    1 [total]                  4012 2011 2011 2011 ampl7 '(-inf-0.069957]'        4003    1    1    1 '(0.069957-0.139911]'       1 2002    1 103 '(0.139911-0.209866]'       1    1    1 1900 '(0.209866-0.27982]'        1    1    3    1 '(0.27982-0.349775]'        1    1    7    1 '(0.349775-0.419729]'       1    1 1987    1 '(0.419729-0.489683]'       1    1    6    1 '(0.489683-0.559638]'       1    1    2    1 '(0.559638-0.629592]'       1    1    1    1 '(0.629592-inf)'            1    1    2    1 [total]                  4012 2011 2011 2011 ampl8 '(-inf-0.062418]'        4003    3    1    1 '(0.062418-0.124833]'       1 2000    1 887 '(0.124833-0.187249]'       1    1    1 1116 '(0.187249-0.249665]'       1    1 1954    1 '(0.249665-0.312081]'       1    1   44    1 '(0.312081-0.374496]'       1    1    4    1 '(0.374496-0.436912]'       1    1    2    1 '(0.436912-0.499328]'       1    1    1    1 '(0.499328-0.561743]'       1    1    1    1 '(0.561743-inf)'            1    1    2    1 [total]                  4012 2011 2011 2011 toosc PulseOsc                    1    1 2002    1 SawOsc                      1    1    1 2002 SinOsc                   2002    1    1    1 SqrOsc                      1 2002    1    1 TriOsc                   2002    1    1    1 [total]                  4007 2006 2006 2006 Time taken to build model (full training data) : 3.15 seconds === Model and evaluation on training set === Clustered Instances 0       4002 ( 40%) 1       2001 ( 20%) 2       2001 ( 20%) 3       2001 ( 20%) EXAMPLE 2 === Run information === Scheme:       weka.clusterers.EM -I 100 -N 5 -X 10 -max -1 -ll-cv 1.0E-6 -ll-iter 1.0E-6 -M 1.0E-6 -K 10 -num-slots 1 -S 100 Relation:     extractAudioFreqARFF_WavPathsModule-weka.filters.unsupervised.attribute.Remove-R66-69-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last-precision6-weka.filters.unsupervised.attribute.Remove-R1-2-weka.filters.unsupervised.attribute.Remove-R1-2,4,6,8,10,12,14-62 Instances:    10005 Attributes:   7               ampl3               ampl4               ampl5               ampl6               ampl7               ampl8               toosc Test mode:    evaluate on training data === Clustering model (full training set) === EM == Number of clusters: 5 Number of iterations performed: 0                        Cluster Attribute                    0    1    2    3    4                          (0.2)(0.2)(0.2)(0.2)(0.2) =================================================== ampl3 '(-inf-0.094748]'           1 2002    1 2002    1 '(0.094748-0.189492]'       1    1    1    1    1 '(0.189492-0.284237]'    2002    1    1    1    1 '(0.284237-0.378982]'       1    1    1    1 2002 '(0.378982-0.473726]'       1    1    1    1    1 '(0.473726-0.568471]'       1    1    1    1    1 '(0.568471-0.663216]'       1    1    1    1    1 '(0.663216-0.757961]'       1    1    1    1    1 '(0.757961-0.852705]'       1    1    3    1    1 '(0.852705-inf)'            1    1 2000    1    1 [total]                  2011 2011 2011 2011 2011 ampl4 '(-inf-0.090169]'           1 2002    1 2002    1 '(0.090169-0.180334]'    2002    1    1    1    1 '(0.180334-0.2705]'         1    1    1    1 2002 '(0.2705-0.360665]'         1    1    1    1    1 '(0.360665-0.450831]'       1    1    1    1    1 '(0.450831-0.540997]'       1    1    1    1    1 '(0.540997-0.631162]'       1    1    1    1    1 '(0.631162-0.721328]'       1    1    3    1    1 '(0.721328-0.811493]'       1    1 1998    1    1 '(0.811493-inf)'            1    1    3    1    1 [total]                  2011 2011 2011 2011 2011 ampl5 '(-inf-0.084579]'           1 2002    1 2002    1 '(0.084579-0.169155]'    2002    1    1    1    1 '(0.169155-0.253732]'       1    1    1    1 2002 '(0.253732-0.338308]'       1    1    1    1    1 '(0.338308-0.422884]'       1    1    1    1    1 '(0.422884-0.50746]'        1    1    1    1    1 '(0.50746-0.592036]'        1    1    3    1    1 '(0.592036-0.676613]'       1    1 1993    1    1 '(0.676613-0.761189]'       1    1    7    1    1 '(0.761189-inf)'            1    1    2    1    1 [total]                  2011 2011 2011 2011 2011 ampl6 '(-inf-0.079019]'           1 2002    1 2002    1 '(0.079019-0.158035]'    2002    1    1    1    2 '(0.158035-0.237051]'       1    1    1    1 2001 '(0.237051-0.316067]'       1    1    1    1    1 '(0.316067-0.395083]'       1    1    3    1    1 '(0.395083-0.474098]'       1    1    2    1    1 '(0.474098-0.553114]'       1    1 1993    1    1 '(0.553114-0.63213]'        1    1    6    1    1 '(0.63213-0.711146]'        1    1    1    1    1 '(0.711146-inf)'            1    1    2    1    1 [total]                  2011 2011 2011 2011 2011 ampl7 '(-inf-0.069957]'           1 2002    1 2002    1 '(0.069957-0.139911]'    2002    1    1    1 103 '(0.139911-0.209866]'       1    1    1    1 1900 '(0.209866-0.27982]'        1    1    3    1    1 '(0.27982-0.349775]'        1    1    7    1    1 '(0.349775-0.419729]'       1    1 1987    1    1 '(0.419729-0.489683]'       1    1    6    1    1 '(0.489683-0.559638]'       1    1    2    1    1 '(0.559638-0.629592]'       1    1    1    1    1 '(0.629592-inf)'            1    1    2    1    1 [total]                  2011 2011 2011 2011 2011 ampl8 '(-inf-0.062418]'           3 2002    1 2002    1 '(0.062418-0.124833]'    2000    1    1    1 887 '(0.124833-0.187249]'       1    1    1    1 1116 '(0.187249-0.249665]'       1    1 1954    1    1 '(0.249665-0.312081]'       1    1   44    1    1 '(0.312081-0.374496]'       1    1    4    1    1 '(0.374496-0.436912]'       1    1    2    1    1 '(0.436912-0.499328]'       1    1    1    1    1 '(0.499328-0.561743]'       1    1    1    1    1 '(0.561743-inf)'            1    1    2    1    1 [total]                  2011 2011 2011 2011 2011 toosc PulseOsc                    1    1 2002    1    1 SawOsc                      1    1    1    1 2002 SinOsc                      1 2002    1    1    1 SqrOsc                   2002    1    1    1    1 TriOsc                      1    1    1 2002    1 [total]                  2006 2006 2006 2006 2006 Time taken to build model (full training data) : 0.08 seconds === Model and evaluation on training set === Clustered Instances 0       2001 ( 20%) 1       2001 ( 20%) 2       2001 ( 20%) 3       2001 ( 20%) 4       2001 ( 20%) EXAMPLE 3 (K-means Random start seed 10, 5 clusters) Scheme:       weka.clusterers.SimpleKMeans -init 0 -max-candidates 100 -periodic-pruning 10000 -min-density 2.0 -t1 -1.25 -t2 -1.0 -N 5 -A "weka.core.EuclideanDistance -R first-last" -I 500 -num-slots 1 -S 10 Relation:     extractAudioFreqARFF_WavPathsModule-weka.filters.unsupervised.attribute.Remove-R66-69-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last-precision6-weka.filters.unsupervised.attribute.Remove-R1-2-weka.filters.unsupervised.attribute.Remove-R1-2,4,6,8,10,12,14-62 Instances:    10005 Attributes:   7               ampl3               ampl4               ampl5               ampl6               ampl7               ampl8               toosc Test mode:    evaluate on training data === Clustering model (full training set) === kMeans ====== Number of iterations: 2 Within cluster sum of squared errors: 1078.0 Initial starting points (random): Cluster 0: '\'(0.284237-0.378982]\'','\'(0.180334-0.2705]\'','\'(0.169155-0.253732]\'','\'(0.158035-0.237051]\'','\'(0.139911-0.209866]\'','\'(0.124833-0.187249]\'',SawOsc Cluster 1: '\'(-inf-0.094748]\'','\'(-inf-0.090169]\'','\'(-inf-0.084579]\'','\'(-inf-0.079019]\'','\'(-inf-0.069957]\'','\'(-inf-0.062418]\'',TriOsc Cluster 2: '\'(0.852705-inf)\'','\'(0.721328-0.811493]\'','\'(0.592036-0.676613]\'','\'(0.474098-0.553114]\'','\'(0.349775-0.419729]\'','\'(0.187249-0.249665]\'',PulseOsc Cluster 3: '\'(-inf-0.094748]\'','\'(-inf-0.090169]\'','\'(-inf-0.084579]\'','\'(-inf-0.079019]\'','\'(-inf-0.069957]\'','\'(-inf-0.062418]\'',SinOsc Cluster 4: '\'(0.189492-0.284237]\'','\'(0.090169-0.180334]\'','\'(0.084579-0.169155]\'','\'(0.079019-0.158035]\'','\'(0.069957-0.139911]\'','\'(0.062418-0.124833]\'',SqrOsc Missing values globally replaced with mean/mode Final cluster centroids:                                                             Cluster# Attribute                            Full Data                     0                     1                     2                     3                     4                                      (10005.0)              (2001.0)              (2001.0)              (2001.0)              (2001.0)              (2001.0) ============================================================================================================================================================ ampl3                        '(-inf-0.094748]' '(0.284237-0.378982]'     '(-inf-0.094748]'      '(0.852705-inf)'     '(-inf-0.094748]' '(0.189492-0.284237]' ampl4                        '(-inf-0.090169]'   '(0.180334-0.2705]'     '(-inf-0.090169]' '(0.721328-0.811493]'     '(-inf-0.090169]' '(0.090169-0.180334]' ampl5                        '(-inf-0.084579]' '(0.169155-0.253732]'     '(-inf-0.084579]' '(0.592036-0.676613]'     '(-inf-0.084579]' '(0.084579-0.169155]' ampl6                        '(-inf-0.079019]' '(0.158035-0.237051]'     '(-inf-0.079019]' '(0.474098-0.553114]'     '(-inf-0.079019]' '(0.079019-0.158035]' ampl7                        '(-inf-0.069957]' '(0.139911-0.209866]'     '(-inf-0.069957]' '(0.349775-0.419729]'     '(-inf-0.069957]' '(0.069957-0.139911]' ampl8                        '(-inf-0.062418]' '(0.124833-0.187249]'     '(-inf-0.062418]' '(0.187249-0.249665]'     '(-inf-0.062418]' '(0.062418-0.124833]' toosc                                 PulseOsc                SawOsc                TriOsc              PulseOsc                SinOsc                SqrOsc Time taken to build model (full training data) : 0.01 seconds === Model and evaluation on training set === Clustered Instances 0       2001 ( 20%) 1       2001 ( 20%) 2       2001 ( 20%) 3       2001 ( 20%) 4       2001 ( 20%) EXAMPLE 4 (K-means k-means++ start, seed 10, 5 clusters) Scheme:       weka.clusterers.SimpleKMeans -init 1 -max-candidates 100 -periodic-pruning 10000 -min-density 2.0 -t1 -1.25 -t2 -1.0 -N 5 -A "weka.core.EuclideanDistance -R first-last" -I 500 -num-slots 1 -S 10 Relation:     extractAudioFreqARFF_WavPathsModule-weka.filters.unsupervised.attribute.Remove-R66-69-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last-precision6-weka.filters.unsupervised.attribute.Remove-R1-2-weka.filters.unsupervised.attribute.Remove-R1-2,4,6,8,10,12,14-62 Instances:    10005 Attributes:   7               ampl3               ampl4               ampl5               ampl6               ampl7               ampl8               toosc Test mode:    evaluate on training data === Clustering model (full training set) === kMeans ====== Number of iterations: 2 Within cluster sum of squared errors: 2985.0 Initial starting points (k-means++): Cluster 0: '\'(0.284237-0.378982]\'','\'(0.180334-0.2705]\'','\'(0.169155-0.253732]\'','\'(0.158035-0.237051]\'','\'(0.139911-0.209866]\'','\'(0.124833-0.187249]\'',SawOsc Cluster 1: '\'(-inf-0.094748]\'','\'(-inf-0.090169]\'','\'(-inf-0.084579]\'','\'(-inf-0.079019]\'','\'(-inf-0.069957]\'','\'(-inf-0.062418]\'',SinOsc Cluster 2: '\'(0.284237-0.378982]\'','\'(0.180334-0.2705]\'','\'(0.169155-0.253732]\'','\'(0.158035-0.237051]\'','\'(0.069957-0.139911]\'','\'(0.062418-0.124833]\'',SawOsc Cluster 3: '\'(0.189492-0.284237]\'','\'(0.090169-0.180334]\'','\'(0.084579-0.169155]\'','\'(0.079019-0.158035]\'','\'(0.069957-0.139911]\'','\'(0.062418-0.124833]\'',SqrOsc Cluster 4: '\'(0.852705-inf)\'','\'(0.721328-0.811493]\'','\'(0.592036-0.676613]\'','\'(0.474098-0.553114]\'','\'(0.349775-0.419729]\'','\'(0.187249-0.249665]\'',PulseOsc Missing values globally replaced with mean/mode Final cluster centroids:                                                             Cluster# Attribute                            Full Data                     0                     1                     2                     3                     4                                      (10005.0)              (1954.0)              (4002.0)                (47.0)              (2001.0)              (2001.0) ============================================================================================================================================================ ampl3                        '(-inf-0.094748]' '(0.284237-0.378982]'     '(-inf-0.094748]' '(0.284237-0.378982]' '(0.189492-0.284237]'      '(0.852705-inf)' ampl4                        '(-inf-0.090169]'   '(0.180334-0.2705]'     '(-inf-0.090169]'   '(0.180334-0.2705]' '(0.090169-0.180334]' '(0.721328-0.811493]' ampl5                        '(-inf-0.084579]' '(0.169155-0.253732]'     '(-inf-0.084579]' '(0.169155-0.253732]' '(0.084579-0.169155]' '(0.592036-0.676613]' ampl6                        '(-inf-0.079019]' '(0.158035-0.237051]'     '(-inf-0.079019]' '(0.158035-0.237051]' '(0.079019-0.158035]' '(0.474098-0.553114]' ampl7                        '(-inf-0.069957]' '(0.139911-0.209866]'     '(-inf-0.069957]' '(0.069957-0.139911]' '(0.069957-0.139911]' '(0.349775-0.419729]' ampl8                        '(-inf-0.062418]' '(0.124833-0.187249]'     '(-inf-0.062418]' '(0.062418-0.124833]' '(0.062418-0.124833]' '(0.187249-0.249665]' toosc                                 PulseOsc                SawOsc                SinOsc                SawOsc                SqrOsc              PulseOsc Time taken to build model (full training data) : 0 seconds === Model and evaluation on training set === Clustered Instances 0       1954 ( 20%) 1       4002 ( 40%) 2         47 ( 0%) 3       2001 ( 20%) 4       2001 ( 20%)