Estimating Model Performance Under Covariate Shift Without Labels
Jakub Białek, Juhani Kivimäki, Wojtek Kuberski, Nikolaos Perrakis
TL;DR
PAPE tackles the problem of estimating binary classifier performance after deployment when labels are unavailable and data shift covariates change. It combines density-ratio estimation to align source and target distributions with a calibrated score transformation, enabling CBPE-style estimation for any metric derived from the confusion matrix and providing uncertainty bounds under approximate calibration. The method extends prior unsupervised estimators beyond accuracy, demonstrates strong empirical gains across 900 dataset-model combinations and 36k data chunks, and discusses practical limitations and future extensions. The work offers a scalable, non-invasive monitoring approach with potential to quantify business impact and guide retraining or downstream process adjustments without requiring labeled production data.
Abstract
After deployment, machine learning models often experience performance degradation due to shifts in data distribution. It is challenging to assess post-deployment performance accurately when labels are missing or delayed. Existing proxy methods, such as data drift detection, fail to measure the effects of these shifts adequately. To address this, we introduce a new method for evaluating binary classification models on unlabeled tabular data that accurately estimates model performance under covariate shift and call it Probabilistic Adaptive Performance Estimation (PAPE). It can be applied to any performance metric defined with elements of the confusion matrix. Crucially, PAPE operates independently of the original model, relying only on its predictions and probability estimates, and does not need any assumptions about the nature of covariate shift, learning directly from data instead. We tested PAPE using over 900 dataset-model combinations from US census data, assessing its performance against several benchmarks through various metrics. Our findings show that PAPE outperforms other methodologies, making it a superior choice for estimating the performance of binary classification models.
