Table of Contents
Fetching ...

A Benchmark and Evaluation for Real-World Out-of-Distribution Detection Using Vision-Language Models

Shiho Noda, Atsuyuki Miyai, Qing Yu, Go Irie, Kiyoharu Aizawa

TL;DR

This paper addresses the saturation of traditional OOD benchmarks by introducing three realistic benchmarks—ImageNet-X, ImageNet-FS-X, and Wilds-FS-X—that isolate semantic shifts, covariate shifts, and real-world domain shifts in vision-language (CLIP-based) OOD detection. It systematically evaluates zero-shot and few-shot CLIP-based detectors, revealing that covariate shifts substantially degrade performance and that method rankings vary across benchmarks, with few-shot learning sometimes biasing towards training covariates. The findings underscore the need for robust OOD detection methods that generalize across both semantic and distributional shifts, including real-world covariate changes, and point to the value of these benchmarks for guiding future research. The work emphasizes reproducibility by releasing benchmarks and code, and it advocates broader evaluation practices beyond traditional datasets to advance real-world applicability of OOD detection in vision-language systems.

Abstract

Out-of-distribution (OOD) detection is a task that detects OOD samples during inference to ensure the safety of deployed models. However, conventional benchmarks have reached performance saturation, making it difficult to compare recent OOD detection methods. To address this challenge, we introduce three novel OOD detection benchmarks that enable a deeper understanding of method characteristics and reflect real-world conditions. First, we present ImageNet-X, designed to evaluate performance under challenging semantic shifts. Second, we propose ImageNet-FS-X for full-spectrum OOD detection, assessing robustness to covariate shifts (feature distribution shifts). Finally, we propose Wilds-FS-X, which extends these evaluations to real-world datasets, offering a more comprehensive testbed. Our experiments reveal that recent CLIP-based OOD detection methods struggle to varying degrees across the three proposed benchmarks, and none of them consistently outperforms the others. We hope the community goes beyond specific benchmarks and includes more challenging conditions reflecting real-world scenarios. The code is https://github.com/hoshi23/OOD-X-Benchmarks.

A Benchmark and Evaluation for Real-World Out-of-Distribution Detection Using Vision-Language Models

TL;DR

This paper addresses the saturation of traditional OOD benchmarks by introducing three realistic benchmarks—ImageNet-X, ImageNet-FS-X, and Wilds-FS-X—that isolate semantic shifts, covariate shifts, and real-world domain shifts in vision-language (CLIP-based) OOD detection. It systematically evaluates zero-shot and few-shot CLIP-based detectors, revealing that covariate shifts substantially degrade performance and that method rankings vary across benchmarks, with few-shot learning sometimes biasing towards training covariates. The findings underscore the need for robust OOD detection methods that generalize across both semantic and distributional shifts, including real-world covariate changes, and point to the value of these benchmarks for guiding future research. The work emphasizes reproducibility by releasing benchmarks and code, and it advocates broader evaluation practices beyond traditional datasets to advance real-world applicability of OOD detection in vision-language systems.

Abstract

Out-of-distribution (OOD) detection is a task that detects OOD samples during inference to ensure the safety of deployed models. However, conventional benchmarks have reached performance saturation, making it difficult to compare recent OOD detection methods. To address this challenge, we introduce three novel OOD detection benchmarks that enable a deeper understanding of method characteristics and reflect real-world conditions. First, we present ImageNet-X, designed to evaluate performance under challenging semantic shifts. Second, we propose ImageNet-FS-X for full-spectrum OOD detection, assessing robustness to covariate shifts (feature distribution shifts). Finally, we propose Wilds-FS-X, which extends these evaluations to real-world datasets, offering a more comprehensive testbed. Our experiments reveal that recent CLIP-based OOD detection methods struggle to varying degrees across the three proposed benchmarks, and none of them consistently outperforms the others. We hope the community goes beyond specific benchmarks and includes more challenging conditions reflecting real-world scenarios. The code is https://github.com/hoshi23/OOD-X-Benchmarks.

Paper Structure

This paper contains 21 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Data examples of ID and OOD in ImageNet-X, a benchmark for challenging semantic shifts.
  • Figure 2: Data structure and examples in the proposed benchmarks incorporating covariate shifts (feature distribution shifts).
  • Figure 6: Difference in OOD detection performance (AUROC) between ImageNet-X and ImageNet-FS-X for CLIP-based methods, calculated as ImageNet-X minus ImageNet-FS-X results.
  • Figure A: OOD score distribution and sample scores in ImageNet-X.
  • Figure B: OOD score distribution and sample scores in ImageNet-FS-X. The samples belong to the "piano" label and are part of the ID data.
  • ...and 2 more figures