CSC 558 - Kappa Statistic
Weka's implementation derives Kappa from the confusion matrix. It gives slightly different result than a trivial application of the formula below in some cases. The paper “Understanding Interobserver Agreement: The Kappa Statistic” goes in depth.
From
https://stats.stackexchange.com/questions/82162/cohens-kappa-in-plain-english:
“The Kappa statistic (or value) is a metric that compares an Observed
Accuracy with an Expected Accuracy (random chance). The kappa statistic
is used not only to evaluate a single classifier, but also to evaluate
classifiers amongst themselves. In addition, it takes into account
random chance (agreement with a random classifier), which generally
means it is less misleading than simply using accuracy as a metric (an
Observed Accuracy of 80% is a lot less impressive with an Expected
Accuracy of 75% versus an Expected Accuracy of 50%).
Kappa = (observed accuracy - expected accuracy)/(1 - expected
accuracy)
Not only can this kappa statistic shed light into how the classifier
itself performed, the kappa statistic for one model is directly
comparable to the kappa statistic for any other model used for the same
classification task.” Parson’s example: If you had a 6-sided die that
had the value 1 on 5 sides, and 0 on the other, the random-chance
expected accuracy of rolling a 1 would be 5/6 = 83.3%. Since the ZeroR
classifier simply picks the most statistically likely class without
respect to the other (non-target) attributes, it would pick an expected
die value of 1 in this case, giving a random observed accuracy of 83.3%,
and a Kappa of (.833 - .833) / (1 - .833) = 0.
“Landis and Koch considers 0-0.20 as slight, 0.21-0.40
as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1
as almost perfect. Fleiss considers kappas > 0.75 as excellent,
0.40-0.75 as fair to good, and < 0.40 as poor. It is important to
note that both scales are somewhat arbitrary.
At least two further considerations should be taken into account when interpreting the kappa statistic. First, the kappa statistic should always be compared with an accompanied confusion matrix if possible to obtain the most accurate interpretation. Second, acceptable kappa statistic values vary on the context. For instance, in many inter-rater reliability studies with easily observable behaviors, kappa statistic values below 0.70 might be considered low. However, in studies using machine learning to explore unobservable phenomena like cognitive states such as day dreaming, kappa statistic values above 0.40 might be considered exceptional.”