Distance Matters For Improving Performance Estimation Under Covariate Shift
Mélanie Roschewitz, Ben Glocker
TL;DR
Performance estimation under covariate shift suffers when test samples depart from the training distribution, as confidence scores become unreliable. The authors introduce a distance-check in embedding space that flags samples too distant from the ID distribution and plug it into state-of-the-art estimators to form ATC-DistCS and GDE-DistCS. Across 13 diverse image-classification tasks and hundreds of models, the distance-aware estimators achieve substantial improvements, with a median relative MAE reduction around $27$–$30\%$ and SOTA performance on most tasks, while remaining applicable without OOD data. This approach bridges OOD detection and performance estimation, enabling safer deployment and real-time monitoring, and is supported by publicly available code. The work highlights that incorporating distribution-distance information is crucial for reliable performance estimation under distribution shifts.
Abstract
Performance estimation under covariate shift is a crucial component of safe AI model deployment, especially for sensitive use-cases. Recently, several solutions were proposed to tackle this problem, most leveraging model predictions or softmax confidence to derive accuracy estimates. However, under dataset shifts, confidence scores may become ill-calibrated if samples are too far from the training distribution. In this work, we show that taking into account distances of test samples to their expected training distribution can significantly improve performance estimation under covariate shift. Precisely, we introduce a "distance-check" to flag samples that lie too far from the expected distribution, to avoid relying on their untrustworthy model outputs in the accuracy estimation step. We demonstrate the effectiveness of this method on 13 image classification tasks, across a wide-range of natural and synthetic distribution shifts and hundreds of models, with a median relative MAE improvement of 27% over the best baseline across all tasks, and SOTA performance on 10 out of 13 tasks. Our code is publicly available at https://github.com/melanibe/distance_matters_performance_estimation.
