Table of Contents
Fetching ...

OpenOOD v1.5: Enhanced Benchmark for Out-of-Distribution Detection

Jingyang Zhang, Jingkang Yang, Pengyun Wang, Haoqi Wang, Yueqian Lin, Haoran Zhang, Yiyou Sun, Xuefeng Du, Yixuan Li, Ziwei Liu, Yiran Chen, Hai Li

TL;DR

OpenOOD v1.5 addresses the need for scalable, standardized evaluation in Out-of-Distribution (OOD) detection by extending to large-scale datasets (ImageNet-1K/200) and foundation models (e.g., CLIP, DINOv2) and by introducing full-spectrum OOD detection that couples semantic and covariate shifts. It formalizes OOD via density-based definitions and open space risk, and it defines a rigorous evaluation protocol with near- and far-OOD splits, dedicated validation sets, and disjoint OOD training/test data. The paper provides four standard and two full-spectrum benchmarks, analyzes 40 methods across diverse architectures, and delivers actionable insights such as the broad benefit of data augmentations and the nuanced effects of model architecture and training vs post-hoc approaches. Its findings highlight that no single detector dominates across all settings, that full-spectrum detection remains a challenging open problem, and that foundation models show promise but require detector alignment; collectively, OpenOOD v1.5 supplies a robust, scalable benchmark to accelerate progress in OOD detection. $R_O(f)=\frac{\iint f(x)p_{\mathcal{D}_{OOD}}(x,y)\,dx\,dy}{\iint f(x)p_{\mathcal{D}_{OOD}}(x,y)\,dx\,dy+\iint f(x)p_{\mathcal{D}_{ID}}(x,y)\,dx\,dy}$ expresses the open space risk minimized by OOD detectors.

Abstract

Out-of-Distribution (OOD) detection is critical for the reliable operation of open-world intelligent systems. Despite the emergence of an increasing number of OOD detection methods, the evaluation inconsistencies present challenges for tracking the progress in this field. OpenOOD v1 initiated the unification of the OOD detection evaluation but faced limitations in scalability and scope. In response, this paper presents OpenOOD v1.5, a significant improvement from its predecessor that ensures accurate and standardized evaluation of OOD detection methodologies at large scale. Notably, OpenOOD v1.5 extends its evaluation capabilities to large-scale data sets (ImageNet) and foundation models (e.g., CLIP and DINOv2), and expands its scope to investigate full-spectrum OOD detection which considers semantic and covariate distribution shifts at the same time. This work also contributes in-depth analysis and insights derived from comprehensive experimental results, thereby enriching the knowledge pool of OOD detection methodologies. With these enhancements, OpenOOD v1.5 aims to drive advancements and offer a more robust and comprehensive evaluation benchmark for OOD detection research.

OpenOOD v1.5: Enhanced Benchmark for Out-of-Distribution Detection

TL;DR

OpenOOD v1.5 addresses the need for scalable, standardized evaluation in Out-of-Distribution (OOD) detection by extending to large-scale datasets (ImageNet-1K/200) and foundation models (e.g., CLIP, DINOv2) and by introducing full-spectrum OOD detection that couples semantic and covariate shifts. It formalizes OOD via density-based definitions and open space risk, and it defines a rigorous evaluation protocol with near- and far-OOD splits, dedicated validation sets, and disjoint OOD training/test data. The paper provides four standard and two full-spectrum benchmarks, analyzes 40 methods across diverse architectures, and delivers actionable insights such as the broad benefit of data augmentations and the nuanced effects of model architecture and training vs post-hoc approaches. Its findings highlight that no single detector dominates across all settings, that full-spectrum detection remains a challenging open problem, and that foundation models show promise but require detector alignment; collectively, OpenOOD v1.5 supplies a robust, scalable benchmark to accelerate progress in OOD detection. expresses the open space risk minimized by OOD detectors.

Abstract

Out-of-Distribution (OOD) detection is critical for the reliable operation of open-world intelligent systems. Despite the emergence of an increasing number of OOD detection methods, the evaluation inconsistencies present challenges for tracking the progress in this field. OpenOOD v1 initiated the unification of the OOD detection evaluation but faced limitations in scalability and scope. In response, this paper presents OpenOOD v1.5, a significant improvement from its predecessor that ensures accurate and standardized evaluation of OOD detection methodologies at large scale. Notably, OpenOOD v1.5 extends its evaluation capabilities to large-scale data sets (ImageNet) and foundation models (e.g., CLIP and DINOv2), and expands its scope to investigate full-spectrum OOD detection which considers semantic and covariate distribution shifts at the same time. This work also contributes in-depth analysis and insights derived from comprehensive experimental results, thereby enriching the knowledge pool of OOD detection methodologies. With these enhancements, OpenOOD v1.5 aims to drive advancements and offer a more robust and comprehensive evaluation benchmark for OOD detection research.
Paper Structure (15 sections, 2 equations, 8 figures, 6 tables)

This paper contains 15 sections, 2 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Illustration of full-spectrum OOD detection yang2022fsood using our ImageNet benchmark. Standard detection only concerns semantic shift by detecting (c) + (d) from (a), while full-spectrum detection takes into account covariate shift and aims to separate (c) + (d) from (a) + (b). An ideal system should be robust to the non-semantic covariate shift (OOD generalization) while being able to identify semantic shift (OOD detection).
  • Figure 2: Near-OOD improvements are proportional to, yet slower than, far-OOD improvements on ImageNet-1K.
  • Figure 3: OOD detection rates of post-hoc methods with different architectures on ImageNet-1K. Some methods are sensitive to model architecture while some are not. Transformers do not seem to have clear advantage over ResNets.
  • Figure 4: Inference time of each method (in milliseconds and sorted from left to right) on a batch of 200 ImageNet 224x224 images. The base model is ResNet-50. The inference time is profiled with a single 24GB GPU, and we report the average results over 5 runs. We notice that some methods incur significantly larger inference cost than others.
  • Figure 5: Comparison between standard and full-spectrum detection on ImageNet-1K (near-OOD). Many detectors suffer significant performance degradation in the full-spectrum setting.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2