CSC 558 - Data Mining and Predictive Analytics
Kappa Statistic

Dr. Dale E. Parson

Weka's implementation derives Kappa from the confusion matrix.
    It gives slightly different result than a trivial application of the formula below in some cases.
    Here is a Python implementation of the same algorithm that gives same result as Weka.
        See my analysis of this Kappa algorithm^^^.
       
Here is a paper summarizing Kappa, including this code^^^in Table 1.

From https://stats.stackexchange.com/questions/82162/cohens-kappa-in-plain-english:

“The Kappa statistic (or value) is a metric that compares an Observed Accuracy with an Expected Accuracy (random chance). The kappa statistic is used not only to evaluate a single classifier, but also to evaluate classifiers amongst themselves. In addition, it takes into account random chance (agreement with a random classifier), which generally means it is less misleading than simply using accuracy as a metric (an Observed Accuracy of 80% is a lot less impressive with an Expected Accuracy of 75% versus an Expected Accuracy of 50%).
Kappa = (observed accuracy - expected accuracy)/(1 - expected accuracy)
Not only can this kappa statistic shed light into how the classifier itself performed, the kappa statistic for one model is directly comparable to the kappa statistic for any other model used for the same classification task.” Parson’s example: If you had a 6-sided die that had the value 1 on 5 sides, and 0 on the other, the random-chance expected accuracy of rolling a 1 would be 5/6 = 83.3%. Since the ZeroR classifier simply picks the most statistically likely class without respect to the other (non-target) attributes, it would pick an expected die value of 1 in this case, giving a random observed accuracy of 83.3%, and a Kappa of (.833 - .833) / (1 - .833) = 0.

Also from this linked site: “Landis and Koch considers 0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1 as almost perfect. Fleiss considers kappas > 0.75 as excellent, 0.40-0.75 as fair to good, and < 0.40 as poor. It is important to note that both scales are somewhat arbitrary.

At least two further considerations should be taken into account when interpreting the kappa statistic. First, the kappa statistic should always be compared with an accompanied confusion matrix if possible to obtain the most accurate interpretation. Second, acceptable kappa statistic values vary on the context. For instance, in many inter-rater reliability studies with easily observable behaviors, kappa statistic values below 0.70 might be considered low. However, in studies using machine learning to explore unobservable phenomena like cognitive states such as day dreaming, kappa statistic values above 0.40 might be considered exceptional.”