Table of Contents
Fetching ...

Limits to classification performance by relating Kullback-Leibler divergence to Cohen's Kappa

L. Crow, S. J. Watts

TL;DR

Four very different real datasets - Breast Cancer, Coronary Heart Disease, Bankruptcy, and Particle Identification - are analysed, with both continuous and discrete values, and their classification performance compared to the expected theoretical limit shows that the algorithms could not have performed any better.

Abstract

The performance of machine learning classification algorithms are evaluated by estimating metrics, often from the confusion matrix, using training data and cross-validation. However, these do not prove that the best possible performance has been achieved. Fundamental limits to error rates can be estimated using information distance measures. To this end, the confusion matrix has been formulated to comply with the Chernoff-Stein Lemma. This links the error rates to the Kullback-Leibler divergences between the probability density functions describing the two classes. This leads to a key result that relates Cohen's Kappa to the Resistor Average Distance which is the parallel resistor combination of the two Kullback-Leibler divergences. The Resistor Average Distance has units of bits and is estimated from the same training data used by the classification algorithm, using kNN estimates of the KullBack-Leibler divergences. The classification algorithm gives the confusion matrix and Kappa. Theory and methods are discussed in detail and then applied to Monte Carlo data and real datasets. Four very different real datasets - Breast Cancer, Coronary Heart Disease, Bankruptcy, and Particle Identification - are analysed, with both continuous and discrete values, and their classification performance compared to the expected theoretical limit. In all cases this analysis shows that the algorithms could not have performed any better due to the underlying probability density functions for the two classes. Important lessons are learnt on how to predict the performance of algorithms for imbalanced data using training datasets that are approximately balanced. Machine learning is very powerful but classification performance ultimately depends on the quality of the data and the relevance of the variables to the problem.

Limits to classification performance by relating Kullback-Leibler divergence to Cohen's Kappa

TL;DR

Four very different real datasets - Breast Cancer, Coronary Heart Disease, Bankruptcy, and Particle Identification - are analysed, with both continuous and discrete values, and their classification performance compared to the expected theoretical limit shows that the algorithms could not have performed any better.

Abstract

The performance of machine learning classification algorithms are evaluated by estimating metrics, often from the confusion matrix, using training data and cross-validation. However, these do not prove that the best possible performance has been achieved. Fundamental limits to error rates can be estimated using information distance measures. To this end, the confusion matrix has been formulated to comply with the Chernoff-Stein Lemma. This links the error rates to the Kullback-Leibler divergences between the probability density functions describing the two classes. This leads to a key result that relates Cohen's Kappa to the Resistor Average Distance which is the parallel resistor combination of the two Kullback-Leibler divergences. The Resistor Average Distance has units of bits and is estimated from the same training data used by the classification algorithm, using kNN estimates of the KullBack-Leibler divergences. The classification algorithm gives the confusion matrix and Kappa. Theory and methods are discussed in detail and then applied to Monte Carlo data and real datasets. Four very different real datasets - Breast Cancer, Coronary Heart Disease, Bankruptcy, and Particle Identification - are analysed, with both continuous and discrete values, and their classification performance compared to the expected theoretical limit. In all cases this analysis shows that the algorithms could not have performed any better due to the underlying probability density functions for the two classes. Important lessons are learnt on how to predict the performance of algorithms for imbalanced data using training datasets that are approximately balanced. Machine learning is very powerful but classification performance ultimately depends on the quality of the data and the relevance of the variables to the problem.
Paper Structure (21 sections, 36 equations, 11 figures, 4 tables)

This paper contains 21 sections, 36 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: This figure is an update of one used in ref. Johnson to show the relationships between key information theoretic distances. The curves are for a 1D exponential model described in Section 4.1. Renyi Divergences, $D_{t}(P\parallel Q)$ and $D_{1-t}(Q\parallel P)$, Chernoff Divergence, $C_{t}(P\parallel Q)$, Kullback-Leibler Divergences, $D(P\parallel Q)$ and $D(Q\parallel P)$. The Resistor Average Distance, $R(P,Q)$, Johnson, is at $t=t_{R}$ and Chernoff Information, $C(P,Q)$, is at $t=t_{C}$. $R(P,Q)$ is the value at which the two double lines meet. These are tangential to the Chernoff Divergence at $t=0$ and $t=1$. The Renyi Divergence at $t=1/2$, $D_{1/2}(P\parallel Q)=D_{1/2}(Q\parallel P)$, has a value close to $R(P,Q).$ This is not an accident as the main text explains. Not shown on figure, but for reference, the Bhattacharyya Distance, $B(P,Q)$ is the value of the Chernoff Divergence at $t=$$1/2$ , which is $\frac{1}{2}D_{1/2}(P\parallel Q)$. A second order approximation to the Renyi divergence is used to estimate the Chernoff Divergence, which is shown in the figure. See Section 3.1 for details.
  • Figure 2: Methodology to compare performance of a classification algorithm with expectation from information distance measures. For full details in the main text; Box 1 see Section 2, Box 2 see Sections 3.1 and 3.3, Box 3 see Section 3.2, Box 4 see Sections 3.2 and 3.3.
  • Figure 3: Definition of the two class confusion matrix. The arrows indicate how entries "leak" from their true class (T) to the wrong class (L).
  • Figure 4: a) Relationships between the parameters $K$, $K_{12}$ and $K_{21}$ and average ($1/2(K_{12}+K_{21})$) for the Gaussian Model. Slope of line is 1.0 for all. b) Repeat for the Exponential Model. Slope of line for average is 1.0. This is for balanced data, $f_{1}=f_{2}=0.5$. The models are described in Section 4.1.
  • Figure 5: kNN estimates of the Kullback-Leibler Divergences, $CDI(1,2)$ and $CDI(2,1)$, and Resistor Average Distance, $CDR$ , for both the Gaussian and Exponential models. See Sections 3.3 and 4.2 for detail. Double dashed lines are the theoretical prediction for the Kullback-Leibler divergence. Single dashed lines are the theoretical prediction for the Resistor Average Distance.
  • ...and 6 more figures