Table of Contents
Fetching ...

Distance Matters For Improving Performance Estimation Under Covariate Shift

Mélanie Roschewitz, Ben Glocker

TL;DR

Performance estimation under covariate shift suffers when test samples depart from the training distribution, as confidence scores become unreliable. The authors introduce a distance-check in embedding space that flags samples too distant from the ID distribution and plug it into state-of-the-art estimators to form ATC-DistCS and GDE-DistCS. Across 13 diverse image-classification tasks and hundreds of models, the distance-aware estimators achieve substantial improvements, with a median relative MAE reduction around $27$–$30\%$ and SOTA performance on most tasks, while remaining applicable without OOD data. This approach bridges OOD detection and performance estimation, enabling safer deployment and real-time monitoring, and is supported by publicly available code. The work highlights that incorporating distribution-distance information is crucial for reliable performance estimation under distribution shifts.

Abstract

Performance estimation under covariate shift is a crucial component of safe AI model deployment, especially for sensitive use-cases. Recently, several solutions were proposed to tackle this problem, most leveraging model predictions or softmax confidence to derive accuracy estimates. However, under dataset shifts, confidence scores may become ill-calibrated if samples are too far from the training distribution. In this work, we show that taking into account distances of test samples to their expected training distribution can significantly improve performance estimation under covariate shift. Precisely, we introduce a "distance-check" to flag samples that lie too far from the expected distribution, to avoid relying on their untrustworthy model outputs in the accuracy estimation step. We demonstrate the effectiveness of this method on 13 image classification tasks, across a wide-range of natural and synthetic distribution shifts and hundreds of models, with a median relative MAE improvement of 27% over the best baseline across all tasks, and SOTA performance on 10 out of 13 tasks. Our code is publicly available at https://github.com/melanibe/distance_matters_performance_estimation.

Distance Matters For Improving Performance Estimation Under Covariate Shift

TL;DR

Performance estimation under covariate shift suffers when test samples depart from the training distribution, as confidence scores become unreliable. The authors introduce a distance-check in embedding space that flags samples too distant from the ID distribution and plug it into state-of-the-art estimators to form ATC-DistCS and GDE-DistCS. Across 13 diverse image-classification tasks and hundreds of models, the distance-aware estimators achieve substantial improvements, with a median relative MAE reduction around and SOTA performance on most tasks, while remaining applicable without OOD data. This approach bridges OOD detection and performance estimation, enabling safer deployment and real-time monitoring, and is supported by publicly available code. The work highlights that incorporating distribution-distance information is crucial for reliable performance estimation under distribution shifts.

Abstract

Performance estimation under covariate shift is a crucial component of safe AI model deployment, especially for sensitive use-cases. Recently, several solutions were proposed to tackle this problem, most leveraging model predictions or softmax confidence to derive accuracy estimates. However, under dataset shifts, confidence scores may become ill-calibrated if samples are too far from the training distribution. In this work, we show that taking into account distances of test samples to their expected training distribution can significantly improve performance estimation under covariate shift. Precisely, we introduce a "distance-check" to flag samples that lie too far from the expected distribution, to avoid relying on their untrustworthy model outputs in the accuracy estimation step. We demonstrate the effectiveness of this method on 13 image classification tasks, across a wide-range of natural and synthetic distribution shifts and hundreds of models, with a median relative MAE improvement of 27% over the best baseline across all tasks, and SOTA performance on 10 out of 13 tasks. Our code is publicly available at https://github.com/melanibe/distance_matters_performance_estimation.
Paper Structure (35 sections, 2 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 35 sections, 2 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Performance estimation under covariate shift needs to take into account different sources of errors. Distance to the source distribution in the embedding space matters as confidence estimates become unreliable with increased distance.
  • Figure 2: Why distance matters, an example. Joint TSNE tsne representation of the ID validation set and OOD test set plotted separately for a ResNet18 model on the WILDS CameLyon dataset. We can clearly distinguish a region with low density on the validation set and high density on the OOD set, where most points are misclassified.
  • Figure 3: Ablation study MSE in function of corruption strength for CIFAR10-C across all models, shaded area depicts +/- one standard deviation.
  • Figure 4: Ablation study for the choice of distance estimation method: K-NN (DistCS) versus Mahalanobis distance (Maha). Each boxplot shows the distribution of the Mean Absolute Error for accuracy estimation. Whiskers denote the [5%;95%]-percentiles of the distribution, outliers omitted for readability. Using distance improves the results for all but one dataset, no matter if K-NN or Mahalanobis distance. However, K-NN distance is better than Mahalanobis overall. For additional datasets, see Supp. Note 4.
  • Figure 5: Predicted versus true accuracy for all models and datasets.
  • ...and 3 more figures