Distance Matters For Improving Performance Estimation Under Covariate Shift

Mélanie Roschewitz; Ben Glocker

Distance Matters For Improving Performance Estimation Under Covariate Shift

Mélanie Roschewitz, Ben Glocker

TL;DR

Performance estimation under covariate shift suffers when test samples depart from the training distribution, as confidence scores become unreliable. The authors introduce a distance-check in embedding space that flags samples too distant from the ID distribution and plug it into state-of-the-art estimators to form ATC-DistCS and GDE-DistCS. Across 13 diverse image-classification tasks and hundreds of models, the distance-aware estimators achieve substantial improvements, with a median relative MAE reduction around $27$–$30\%$ and SOTA performance on most tasks, while remaining applicable without OOD data. This approach bridges OOD detection and performance estimation, enabling safer deployment and real-time monitoring, and is supported by publicly available code. The work highlights that incorporating distribution-distance information is crucial for reliable performance estimation under distribution shifts.

Abstract

Performance estimation under covariate shift is a crucial component of safe AI model deployment, especially for sensitive use-cases. Recently, several solutions were proposed to tackle this problem, most leveraging model predictions or softmax confidence to derive accuracy estimates. However, under dataset shifts, confidence scores may become ill-calibrated if samples are too far from the training distribution. In this work, we show that taking into account distances of test samples to their expected training distribution can significantly improve performance estimation under covariate shift. Precisely, we introduce a "distance-check" to flag samples that lie too far from the expected distribution, to avoid relying on their untrustworthy model outputs in the accuracy estimation step. We demonstrate the effectiveness of this method on 13 image classification tasks, across a wide-range of natural and synthetic distribution shifts and hundreds of models, with a median relative MAE improvement of 27% over the best baseline across all tasks, and SOTA performance on 10 out of 13 tasks. Our code is publicly available at https://github.com/melanibe/distance_matters_performance_estimation.

Distance Matters For Improving Performance Estimation Under Covariate Shift

TL;DR

–

and SOTA performance on most tasks, while remaining applicable without OOD data. This approach bridges OOD detection and performance estimation, enabling safer deployment and real-time monitoring, and is supported by publicly available code. The work highlights that incorporating distribution-distance information is crucial for reliable performance estimation under distribution shifts.

Abstract

Paper Structure (35 sections, 2 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 35 sections, 2 equations, 8 figures, 4 tables, 1 algorithm.

Introduction
Methodological contributions
Main results
Background
Performance estimation without ground truth
Estimating performance via auxiliary task performance
Training a regressor between ID and OOD accuracy
Agreement-based estimators
Confidence-based estimators
Distance-based out-of-distribution detection
Methods
Base estimators
Average Thresholded Confidence (ATC) garg2022leveraging
Generalised Disagreement Equality (GDE) jiang2022assessing
Integrating distance to training set
...and 20 more sections

Figures (8)

Figure 1: Performance estimation under covariate shift needs to take into account different sources of errors. Distance to the source distribution in the embedding space matters as confidence estimates become unreliable with increased distance.
Figure 2: Why distance matters, an example. Joint TSNE tsne representation of the ID validation set and OOD test set plotted separately for a ResNet18 model on the WILDS CameLyon dataset. We can clearly distinguish a region with low density on the validation set and high density on the OOD set, where most points are misclassified.
Figure 3: Ablation study MSE in function of corruption strength for CIFAR10-C across all models, shaded area depicts +/- one standard deviation.
Figure 4: Ablation study for the choice of distance estimation method: K-NN (DistCS) versus Mahalanobis distance (Maha). Each boxplot shows the distribution of the Mean Absolute Error for accuracy estimation. Whiskers denote the [5%;95%]-percentiles of the distribution, outliers omitted for readability. Using distance improves the results for all but one dataset, no matter if K-NN or Mahalanobis distance. However, K-NN distance is better than Mahalanobis overall. For additional datasets, see Supp. Note 4.
Figure 5: Predicted versus true accuracy for all models and datasets.
...and 3 more figures

Distance Matters For Improving Performance Estimation Under Covariate Shift

TL;DR

Abstract

Distance Matters For Improving Performance Estimation Under Covariate Shift

Authors

TL;DR

Abstract

Table of Contents

Figures (8)