CSC 558 - Data
Mining and Predictive Analytics
Kappa Statistic
Dr.
Dale E. Parson
Weka's
implementation derives Kappa from the confusion matrix.
It gives slightly different result than a
trivial application of the formula below in some cases.
Here is a Python implementation of
the same algorithm that gives same result as
Weka.
See my analysis of
this Kappa algorithm^^^.
Here is a paper summarizing
Kappa, including this code^^^in Table 1.
From https://stats.stackexchange.com/questions/82162/cohens-kappa-in-plain-english:
“The Kappa statistic (or value) is a metric that compares an
Observed Accuracy with an Expected Accuracy (random chance). The
kappa statistic is used not only to evaluate a single
classifier, but also to evaluate classifiers amongst themselves.
In addition, it takes into account random chance (agreement with
a random classifier), which generally means it is less
misleading than simply using accuracy as a metric (an Observed
Accuracy of 80% is a lot less impressive with an Expected
Accuracy of 75% versus an Expected Accuracy of 50%).
Kappa = (observed accuracy - expected accuracy)/(1 - expected
accuracy)
Not only can this kappa statistic shed light into how the
classifier itself performed, the kappa statistic for one model
is directly comparable to the kappa statistic for any other
model used for the same classification task.” Parson’s example:
If you had a 6-sided die that had the value 1 on 5 sides, and 0
on the other, the random-chance expected accuracy of rolling a 1
would be 5/6 = 83.3%. Since the ZeroR classifier simply picks
the most statistically likely class without respect to the other
(non-target) attributes, it would pick an expected die value of
1 in this case, giving a random observed accuracy of 83.3%, and
a Kappa of (.833 - .833) / (1 - .833) = 0.
Also from this linked site: “Landis and Koch considers
0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate,
0.61-0.80 as substantial, and 0.81-1 as almost perfect.
Fleiss considers kappas > 0.75 as excellent, 0.40-0.75 as
fair to good, and < 0.40 as poor. It is important to note
that both scales are somewhat arbitrary.
At least two further considerations should be taken into account
when interpreting the kappa statistic. First, the kappa
statistic should always be compared with an accompanied
confusion matrix if possible to obtain the most accurate
interpretation. Second, acceptable kappa statistic values vary
on the context. For instance, in many inter-rater reliability
studies with easily observable behaviors, kappa statistic values
below 0.70 might be considered low. However, in studies using
machine learning to explore unobservable phenomena like
cognitive states such as day dreaming, kappa statistic values
above 0.40 might be considered exceptional.”