Table of Contents
Fetching ...

Model Evaluation in the Dark: Robust Classifier Metrics with Missing Labels

Danial Dervovic, Michael Cashmore

TL;DR

This work addresses the challenge of evaluating binary classifiers when test-time labels are incomplete, a setting that induces bias under MNAR. It introduces Performance Estimation by Multiple Imputation (PEMI) and its Gaussian-approximation variant PEMI-Gauss to produce a predictive distribution over common evaluation metrics (e.g., precision, recall, F1, ROC-AUC) by imputing missing labels with calibrated Bernoulli draws. The authors establish finite-sample convergence bounds to normality for sums and ratios of Bernoulli variables, prove robustness under calibrated-noise in the imputations, and demonstrate strong empirical fidelity across MCAR and MNAR settings with real datasets, provided the calibrator is well-tuned. The approach yields actionable uncertainty quantification for model evaluation, offering practical benefits for model monitoring and deployment in production systems where label feedback is delayed or incomplete. The results highlight the importance of calibration quality and suggest promising directions for extending PEMI to more metrics, multi-class problems, and regression tasks, with broader impact on reliable evaluation under missing labels.

Abstract

Missing data in supervised learning is well-studied, but the specific issue of missing labels during model evaluation has been overlooked. Ignoring samples with missing values, a common solution, can introduce bias, especially when data is Missing Not At Random (MNAR). We propose a multiple imputation technique for evaluating classifiers using metrics such as precision, recall, and ROC-AUC. This method not only offers point estimates but also a predictive distribution for these quantities when labels are missing. We empirically show that the predictive distribution's location and shape are generally correct, even in the MNAR regime. Moreover, we establish that this distribution is approximately Gaussian and provide finite-sample convergence bounds. Additionally, a robustness proof is presented, confirming the validity of the approximation under a realistic error model.

Model Evaluation in the Dark: Robust Classifier Metrics with Missing Labels

TL;DR

This work addresses the challenge of evaluating binary classifiers when test-time labels are incomplete, a setting that induces bias under MNAR. It introduces Performance Estimation by Multiple Imputation (PEMI) and its Gaussian-approximation variant PEMI-Gauss to produce a predictive distribution over common evaluation metrics (e.g., precision, recall, F1, ROC-AUC) by imputing missing labels with calibrated Bernoulli draws. The authors establish finite-sample convergence bounds to normality for sums and ratios of Bernoulli variables, prove robustness under calibrated-noise in the imputations, and demonstrate strong empirical fidelity across MCAR and MNAR settings with real datasets, provided the calibrator is well-tuned. The approach yields actionable uncertainty quantification for model evaluation, offering practical benefits for model monitoring and deployment in production systems where label feedback is delayed or incomplete. The results highlight the importance of calibration quality and suggest promising directions for extending PEMI to more metrics, multi-class problems, and regression tasks, with broader impact on reliable evaluation under missing labels.

Abstract

Missing data in supervised learning is well-studied, but the specific issue of missing labels during model evaluation has been overlooked. Ignoring samples with missing values, a common solution, can introduce bias, especially when data is Missing Not At Random (MNAR). We propose a multiple imputation technique for evaluating classifiers using metrics such as precision, recall, and ROC-AUC. This method not only offers point estimates but also a predictive distribution for these quantities when labels are missing. We empirically show that the predictive distribution's location and shape are generally correct, even in the MNAR regime. Moreover, we establish that this distribution is approximately Gaussian and provide finite-sample convergence bounds. Additionally, a robustness proof is presented, confirming the validity of the approximation under a realistic error model.

Paper Structure

This paper contains 49 sections, 15 theorems, 102 equations, 7 figures, 16 tables, 2 algorithms.

Key Result

Lemma 0

Let $Z = \sum_{i = 1}^n a_i Y_i$, where $Y_i \sim \mathcal{B}(p_i)$ and mutually independent, $p_i \in (0, 1)$ and $a_i \in \mathbb{R}\setminus \{0\}$. Then, we have that and $C_0 = 0.5600$ is a universal constant.

Figures (7)

  • Figure 1: Illustration of missingness mechanisms in relation to this work, with arrows indicating causation. $X \in \mathcal{X}$ are input features, $Y \in \mathcal{Y}$ is the label, $M \in \{0, 1\}$ is the missingness flag and $Y^* \in \mathcal{Y} \cup \{\texttt{NA}\}$ is the masked label.
  • Figure 2: Gap between optimistic and pessimistic bounds on classifier performance measures $\widehat{Q}_n^{(S)}$. Each plot (left-to-right) corresponds to a fraction of missing labels $p_m \in \langle 0.1, 0.2, 0.3 \rangle$.
  • Figure 3: Illustration of flexibility afforded by Theorem \ref{['thm:robustness']}. For $\mathbb{V}[P] = 0.0009$, we show the allowed variation in a calibrator's output $P$ when the expectation $\mathbb{E}[P]$ is fixed to the given true Bernoulli probabilities $p \in \{0.1, 0.7, 0.95\}$.
  • Figure 4: Effect of varying MNAR class imbalance $\eta \in (0, 1)$ on fidelity using PEMI-Gauss and baseline for $p_{m} = 0.3$. Each column represents a metric estimator, with the top row showing $W_1$-distance and bottom row showing MAE. Error bars are bootstrapped confidence intervals at the $\alpha=0.9$ level. Fidelities are on $y$-axis (lower is better) with $\eta$ varying on the $x$-axis; $x$-axis values are slightly jittered to help separate each series visually.
  • Figure 5: Effect of varying MNAR class imbalance $\eta \in (0, 1)$ on evaluation metric predictive distribution quality using PEMI-Gauss for $p_{m} = 0.1$. Each plot represents a metric $\widehat{Q}_n^{(S)}$. Error bars are bootstrapped confidence intervals at the $\alpha=0.9$ level. Quality is on $y$-axis with $\eta$ varying on the $x$-axis, with $x$-axis values slightly jittered to help separate each series visually. Each line corresponds to a predictive distribution, with $p$ algorithms using PEMI-Gauss.
  • ...and 2 more figures

Theorems & Definitions (28)

  • Lemma 0
  • Theorem 1: Ratio of Correlated Gaussians
  • Theorem 2: Gaussian Approximation
  • Theorem 3: Robustness
  • Proposition 4
  • proof
  • Lemma 5
  • proof
  • Theorem 6: Berry-Esseen, Esseen1956shevtsova2010improvement
  • Lemma 6
  • ...and 18 more