ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction

Han Yu; Kehan Li; Dongbai Li; Yue He; Xingxuan Zhang; Peng Cui

ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction

Han Yu, Kehan Li, Dongbai Li, Yue He, Xingxuan Zhang, Peng Cui

TL;DR

This work tackles the challenge of predicting a trained model's performance on unlabeled out-of-distribution (OOD) data. It introduces ODP-Bench, a large-scale benchmark comprising 29 OOD datasets, 1,444 off-the-shelf models, 10 prediction algorithms, and a codebase for extensibility, enabling fair and reproducible comparisons. Experimental results show strong predictive signals on synthetic corruptions but limited performance on complex real-world shifts, especially subpopulation shifts, underscoring gaps in current approaches. The benchmark provides a practical path toward reliable model deployment in risk-sensitive contexts by standardizing evaluation and facilitating systematic analysis of OOD performance prediction methods.

Abstract

Recently, there has been gradually more attention paid to Out-of-Distribution (OOD) performance prediction, whose goal is to predict the performance of trained models on unlabeled OOD test datasets, so that we could better leverage and deploy off-the-shelf trained models in risk-sensitive scenarios. Although progress has been made in this area, evaluation protocols in previous literature are inconsistent, and most works cover only a limited number of real-world OOD datasets and types of distribution shifts. To provide convenient and fair comparisons for various algorithms, we propose Out-of-Distribution Performance Prediction Benchmark (ODP-Bench), a comprehensive benchmark that includes most commonly used OOD datasets and existing practical performance prediction algorithms. We provide our trained models as a testbench for future researchers, thus guaranteeing the consistency of comparison and avoiding the burden of repeating the model training process. Furthermore, we also conduct in-depth experimental analyses to better understand their capability boundary.

ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction

TL;DR

Abstract

ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)