Table of Contents
Fetching ...

Predicting Classification Accuracy When Adding New Unobserved Classes

Yuli Slavutsky, Yuval Benjamini

TL;DR

The paper tackles predicting the final accuracy of marginal multiclass classifiers when the deployed class set grows beyond the observed sample. It introduces the reversed ROC (rROC) framework and proves that the expected accuracy on $k$ classes satisfies $\mathbb{E}_k[\mathcal{A}] = \mathbb{E}_x[C_x^{k-1}]$, with $rROC$ relating to accuracy via $\mathbb{E}_k[\mathcal{A}] = 1 - (k-1)\int_0^1 (1-\overline{\text{rROC}}(1-u)) u^{k-2} du$, and that the rAUC equals $\mathbb{E}_2[\mathcal{A}]$. The authors then develop CleaneX, a neural-network-based estimator that learns $\hat{C}_x$ from the observed scores on $k_1$ classes and calibrates predictions with actual accuracies to predict $\mathbb{E}_k[\mathcal{A}]$ for any $k_2>k_1$. Through simulations and experiments on CIFAR-100, LFW, and brain-decoding data, CleaneX consistently achieves lower RMSE and fewer large errors than KDE and non-parametric regression, enabling reliable extrapolation to very large class sets. This provides a practical tool for early assessment and data-collection planning in large-scale multiclass systems where the full class repertoire is unknown at training time.

Abstract

Multiclass classifiers are often designed and evaluated only on a sample from the classes on which they will eventually be applied. Hence, their final accuracy remains unknown. In this work we study how a classifier's performance over the initial class sample can be used to extrapolate its expected accuracy on a larger, unobserved set of classes. For this, we define a measure of separation between correct and incorrect classes that is independent of the number of classes: the "reversed ROC" (rROC), which is obtained by replacing the roles of classes and data-points in the common ROC. We show that the classification accuracy is a function of the rROC in multiclass classifiers, for which the learned representation of data from the initial class sample remains unchanged when new classes are added. Using these results we formulate a robust neural-network-based algorithm, "CleaneX", which learns to estimate the accuracy of such classifiers on arbitrarily large sets of classes. Unlike previous methods, our method uses both the observed accuracies of the classifier and densities of classification scores, and therefore achieves remarkably better predictions than current state-of-the-art methods on both simulations and real datasets of object detection, face recognition, and brain decoding.

Predicting Classification Accuracy When Adding New Unobserved Classes

TL;DR

The paper tackles predicting the final accuracy of marginal multiclass classifiers when the deployed class set grows beyond the observed sample. It introduces the reversed ROC (rROC) framework and proves that the expected accuracy on classes satisfies , with relating to accuracy via , and that the rAUC equals . The authors then develop CleaneX, a neural-network-based estimator that learns from the observed scores on classes and calibrates predictions with actual accuracies to predict for any . Through simulations and experiments on CIFAR-100, LFW, and brain-decoding data, CleaneX consistently achieves lower RMSE and fewer large errors than KDE and non-parametric regression, enabling reliable extrapolation to very large class sets. This provides a practical tool for early assessment and data-collection planning in large-scale multiclass systems where the full class repertoire is unknown at training time.

Abstract

Multiclass classifiers are often designed and evaluated only on a sample from the classes on which they will eventually be applied. Hence, their final accuracy remains unknown. In this work we study how a classifier's performance over the initial class sample can be used to extrapolate its expected accuracy on a larger, unobserved set of classes. For this, we define a measure of separation between correct and incorrect classes that is independent of the number of classes: the "reversed ROC" (rROC), which is obtained by replacing the roles of classes and data-points in the common ROC. We show that the classification accuracy is a function of the rROC in multiclass classifiers, for which the learned representation of data from the initial class sample remains unchanged when new classes are added. Using these results we formulate a robust neural-network-based algorithm, "CleaneX", which learns to estimate the accuracy of such classifiers on arbitrarily large sets of classes. Unlike previous methods, our method uses both the observed accuracies of the classifier and densities of classification scores, and therefore achieves remarkably better predictions than current state-of-the-art methods on both simulations and real datasets of object detection, face recognition, and brain decoding.

Paper Structure

This paper contains 13 sections, 2 theorems, 20 equations, 6 figures, 1 algorithm.

Key Result

Theorem 1

The expected balanced classification accuracy at $k$ classes is

Figures (6)

  • Figure 1: The reversed ROC. The leftmost column shows an example of the score distributions of four data points: the distribution of scores of incorrect classes (in red), and the score of the correct class (in green). The yellow shaded area is the CDF of the incorrect scores distribution evaluated at the correct score, that is $C_{x}$. The second column shows the corresponding $\text{rTPR}$ (green, top) and $\text{rFPR}$ (red, bottom). The third column depicts the resulting $\text{rROC}_x$ curves. The rightmost plot presents the average $\text{rROC}$ over the four data points (solid grey); as the number of averaged data points grows, the $\overline{\text{rROC}}$ curve becomes smoother (dotted-blue).
  • Figure 2: Simulation results. For each scenario we show a boxplot representing the RMSE values obtained over 50 repetitions using CleaneX (left box, orange), regression based method (middle box, blue) and KDE (right box, purple). The boxes extend from the lower to the upper quartile values, with a line at the median; whiskers show values at a distance of at most 1.5 IQR (interquartile range) from the lower and the upper quartiles; outliers are not shown. For $k_1=500$ the KDE based method achieves RMSE values higher than 0.05 and are therefore not shown.
  • Figure 3: Experimental results. A, B, C: accuracy curves of the three datasets as predicted by CleaneX, regression and KDE, respectively; dotted vertical lines denote $k_1$, grey curves correspond to $\bar{\mathcal{A}}^{k_1}_k$ at each repetition, black curves correspond to $\mathbb{E}_k[{\mathcal{A}}]$ for $2 \leq k \leq k_2$; average RMSE taken over all 50 repetitions. D: distribution of RMSE values over the 50 repetitions. E: distribution of the ratio between RMSE values of regression/KDE and CleaneX -- values above 1 (orange dotted line) indicate that CleaneX outperforms competing methods (charts capped at 4). In D and E, boxes show lower quartile, higher quartile and median; whiskers show values at 1.5 IQR from box; outliers not shown.
  • Figure 4: Comparison of predicted accuracy curves produced by CleaneX (left, orange), regression based method (middle, blue) and KDE (right, purple), on the eight simulated datasets with $d=5$ and $k_1=100$ (dotted vertical line). The curves of $\bar{\mathcal{A}}^{k_1}_k$ for each repetition are shown in grey. The black curves correspond to $\mathbb{E}_k[{\mathcal{A}}]$ for $2 \leq k \leq k_2=2000$.
  • Figure 5: Comparison of predicted accuracy curves produced by CleaneX (left, orange), regression based method (middle, blue) and KDE (right, purple), on the eight simulated datasets with $d=5$ and $k_1=500$ (dotted vertical line). The curves of $\bar{\mathcal{A}}^{k_1}_k$ for each repetition are shown in grey. The black curves correspond to $\mathbb{E}_k[{\mathcal{A}}]$ for $2 \leq k \leq k_2=2000$.
  • ...and 1 more figures

Theorems & Definitions (6)

  • Definition 1
  • Remark
  • Theorem 1
  • Proposition 1
  • proof
  • proof