A Benchmark and Evaluation for Real-World Out-of-Distribution Detection Using Vision-Language Models
Shiho Noda, Atsuyuki Miyai, Qing Yu, Go Irie, Kiyoharu Aizawa
TL;DR
This paper addresses the saturation of traditional OOD benchmarks by introducing three realistic benchmarks—ImageNet-X, ImageNet-FS-X, and Wilds-FS-X—that isolate semantic shifts, covariate shifts, and real-world domain shifts in vision-language (CLIP-based) OOD detection. It systematically evaluates zero-shot and few-shot CLIP-based detectors, revealing that covariate shifts substantially degrade performance and that method rankings vary across benchmarks, with few-shot learning sometimes biasing towards training covariates. The findings underscore the need for robust OOD detection methods that generalize across both semantic and distributional shifts, including real-world covariate changes, and point to the value of these benchmarks for guiding future research. The work emphasizes reproducibility by releasing benchmarks and code, and it advocates broader evaluation practices beyond traditional datasets to advance real-world applicability of OOD detection in vision-language systems.
Abstract
Out-of-distribution (OOD) detection is a task that detects OOD samples during inference to ensure the safety of deployed models. However, conventional benchmarks have reached performance saturation, making it difficult to compare recent OOD detection methods. To address this challenge, we introduce three novel OOD detection benchmarks that enable a deeper understanding of method characteristics and reflect real-world conditions. First, we present ImageNet-X, designed to evaluate performance under challenging semantic shifts. Second, we propose ImageNet-FS-X for full-spectrum OOD detection, assessing robustness to covariate shifts (feature distribution shifts). Finally, we propose Wilds-FS-X, which extends these evaluations to real-world datasets, offering a more comprehensive testbed. Our experiments reveal that recent CLIP-based OOD detection methods struggle to varying degrees across the three proposed benchmarks, and none of them consistently outperforms the others. We hope the community goes beyond specific benchmarks and includes more challenging conditions reflecting real-world scenarios. The code is https://github.com/hoshi23/OOD-X-Benchmarks.
