Model Evaluation in the Dark: Robust Classifier Metrics with Missing Labels
Danial Dervovic, Michael Cashmore
TL;DR
This work addresses the challenge of evaluating binary classifiers when test-time labels are incomplete, a setting that induces bias under MNAR. It introduces Performance Estimation by Multiple Imputation (PEMI) and its Gaussian-approximation variant PEMI-Gauss to produce a predictive distribution over common evaluation metrics (e.g., precision, recall, F1, ROC-AUC) by imputing missing labels with calibrated Bernoulli draws. The authors establish finite-sample convergence bounds to normality for sums and ratios of Bernoulli variables, prove robustness under calibrated-noise in the imputations, and demonstrate strong empirical fidelity across MCAR and MNAR settings with real datasets, provided the calibrator is well-tuned. The approach yields actionable uncertainty quantification for model evaluation, offering practical benefits for model monitoring and deployment in production systems where label feedback is delayed or incomplete. The results highlight the importance of calibration quality and suggest promising directions for extending PEMI to more metrics, multi-class problems, and regression tasks, with broader impact on reliable evaluation under missing labels.
Abstract
Missing data in supervised learning is well-studied, but the specific issue of missing labels during model evaluation has been overlooked. Ignoring samples with missing values, a common solution, can introduce bias, especially when data is Missing Not At Random (MNAR). We propose a multiple imputation technique for evaluating classifiers using metrics such as precision, recall, and ROC-AUC. This method not only offers point estimates but also a predictive distribution for these quantities when labels are missing. We empirically show that the predictive distribution's location and shape are generally correct, even in the MNAR regime. Moreover, we establish that this distribution is approximately Gaussian and provide finite-sample convergence bounds. Additionally, a robustness proof is presented, confirming the validity of the approximation under a realistic error model.
