Estimating Model Performance Under Covariate Shift Without Labels

Jakub Białek; Juhani Kivimäki; Wojtek Kuberski; Nikolaos Perrakis

Estimating Model Performance Under Covariate Shift Without Labels

Jakub Białek, Juhani Kivimäki, Wojtek Kuberski, Nikolaos Perrakis

TL;DR

PAPE tackles the problem of estimating binary classifier performance after deployment when labels are unavailable and data shift covariates change. It combines density-ratio estimation to align source and target distributions with a calibrated score transformation, enabling CBPE-style estimation for any metric derived from the confusion matrix and providing uncertainty bounds under approximate calibration. The method extends prior unsupervised estimators beyond accuracy, demonstrates strong empirical gains across 900 dataset-model combinations and 36k data chunks, and discusses practical limitations and future extensions. The work offers a scalable, non-invasive monitoring approach with potential to quantify business impact and guide retraining or downstream process adjustments without requiring labeled production data.

Abstract

After deployment, machine learning models often experience performance degradation due to shifts in data distribution. It is challenging to assess post-deployment performance accurately when labels are missing or delayed. Existing proxy methods, such as data drift detection, fail to measure the effects of these shifts adequately. To address this, we introduce a new method for evaluating binary classification models on unlabeled tabular data that accurately estimates model performance under covariate shift and call it Probabilistic Adaptive Performance Estimation (PAPE). It can be applied to any performance metric defined with elements of the confusion matrix. Crucially, PAPE operates independently of the original model, relying only on its predictions and probability estimates, and does not need any assumptions about the nature of covariate shift, learning directly from data instead. We tested PAPE using over 900 dataset-model combinations from US census data, assessing its performance against several benchmarks through various metrics. Our findings show that PAPE outperforms other methodologies, making it a superior choice for estimating the performance of binary classification models.

Estimating Model Performance Under Covariate Shift Without Labels

TL;DR

Abstract

Paper Structure (42 sections, 6 theorems, 50 equations, 7 figures, 2 tables)

This paper contains 42 sections, 6 theorems, 50 equations, 7 figures, 2 tables.

Introduction
Related Work
Methodology
Unsupervised Performance Estimation Under Covariate Shift
Approximate Confidence Calibration
Probabilistic Adaptive Performance Estimation (PAPE)
Experimental Evaluation
Datasets
Experimental Setup
Benchmarks
PAPE
TEST SET performance
Confidence-based Performance Estimation (CBPE)
Average Threshold Confidence (ATC)
Difference of Confidence (DoC)
...and 27 more sections

Key Result

Theorem 3.1

Assume that $p_s(y|\boldsymbol{x}) = p_t(y|\boldsymbol{x})$ and that $f$ is $\alpha$-approximately multicalibrated in $\mathcal{D}_s$ with respect to $\mathcal{H}$. If $w_{s \rightarrow t} \in \mathcal{H}$, then $K(f, \mathcal{D}_t) \le \alpha$.

Figures (7)

Figure 1: Estimation of AUROC for ACSIncome data (California) and LGBM as the monitored model. The black line is the realized AUROC of the monitored model for each data chunk. The red line is the AUROC estimated with PAPE. The brown dashed line is the TEST SET performance.
Figure 2: Estimation errors (NMAE) of estimated metric vs. realized absolute change as SE for all estimators. The x-axis indicates the center of the data bucket - for example, value 1 indicates a bucket that contains data chunks for which the absolute performance change was between 0 - 2 SE. The left y-axis shows NMAE of the evaluated method for the data bucket. The right y-axis shows the number of data chunks in each bucket on a logarithmic scale as depicted by the grey dashed line.
Figure 3: Effect of sample size on mean absolute error of AUROC estimation. Calculated for sample sizes of 100, 200, 500, 1000, 2000, and 5000, on data from California, for the prediction task: employment.
Figure 4: The distribution of distances from the origin in the training set for the synthetic data.
Figure 5: Estimator performance of each method for all three metrics with gradually increasing covariate shift. Shift magnitude corresponds to threshold $t$.
...and 2 more figures

Theorems & Definitions (15)

Definition 3.1
Definition 3.2
Theorem 3.1
Theorem 3.2
Lemma B.1
proof
Lemma B.2
proof
proof
proof
...and 5 more

Estimating Model Performance Under Covariate Shift Without Labels

TL;DR

Abstract

Estimating Model Performance Under Covariate Shift Without Labels

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (15)