Table of Contents
Fetching ...

Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks

Hongjun Wang, Sagar Vaze, Kai Han

TL;DR

The paper analyzes OOD detection and OSR as two related distribution-shift tasks, revealing strong cross-field performance correlations and introducing a large-scale benchmark that separately evaluates semantic shift (OSR) and covariate shift (OOD). It demonstrates that magnitude-aware scoring rules (notably MLS and Energy) are robust across settings, while Outlier Exposure (OE) saturates on standard benchmarks but struggles to scale when distribution gaps widen. The authors show that the efficacy of OE depends critically on the similarity between auxiliary OOD data and the test OOD distribution, and they introduce the Outlier-Aware Accuracy metric to capture robustness–detection trade-offs. The proposed benchmark and findings offer practical guidance for evaluating and developing methods under real-world distribution shifts and lay a foundation for future unified studies of OOD and OSR.

Abstract

Detecting test-time distribution shift has emerged as a key capability for safely deployed machine learning models, with the question being tackled under various guises in recent years. In this paper, we aim to provide a consolidated view of the two largest sub-fields within the community: out-of-distribution (OOD) detection and open-set recognition (OSR). In particular, we aim to provide rigorous empirical analysis of different methods across settings and provide actionable takeaways for practitioners and researchers. Concretely, we make the following contributions: (i) We perform rigorous cross-evaluation between state-of-the-art methods in the OOD detection and OSR settings and identify a strong correlation between the performances of methods for them; (ii) We propose a new, large-scale benchmark setting which we suggest better disentangles the problem tackled by OOD detection and OSR, re-evaluating state-of-the-art OOD detection and OSR methods in this setting; (iii) We surprisingly find that the best performing method on standard benchmarks (Outlier Exposure) struggles when tested at scale, while scoring rules which are sensitive to the deep feature magnitude consistently show promise; and (iv) We conduct empirical analysis to explain these phenomena and highlight directions for future research. Code: https://github.com/Visual-AI/Dissect-OOD-OSR

Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks

TL;DR

The paper analyzes OOD detection and OSR as two related distribution-shift tasks, revealing strong cross-field performance correlations and introducing a large-scale benchmark that separately evaluates semantic shift (OSR) and covariate shift (OOD). It demonstrates that magnitude-aware scoring rules (notably MLS and Energy) are robust across settings, while Outlier Exposure (OE) saturates on standard benchmarks but struggles to scale when distribution gaps widen. The authors show that the efficacy of OE depends critically on the similarity between auxiliary OOD data and the test OOD distribution, and they introduce the Outlier-Aware Accuracy metric to capture robustness–detection trade-offs. The proposed benchmark and findings offer practical guidance for evaluating and developing methods under real-world distribution shifts and lay a foundation for future unified studies of OOD and OSR.

Abstract

Detecting test-time distribution shift has emerged as a key capability for safely deployed machine learning models, with the question being tackled under various guises in recent years. In this paper, we aim to provide a consolidated view of the two largest sub-fields within the community: out-of-distribution (OOD) detection and open-set recognition (OSR). In particular, we aim to provide rigorous empirical analysis of different methods across settings and provide actionable takeaways for practitioners and researchers. Concretely, we make the following contributions: (i) We perform rigorous cross-evaluation between state-of-the-art methods in the OOD detection and OSR settings and identify a strong correlation between the performances of methods for them; (ii) We propose a new, large-scale benchmark setting which we suggest better disentangles the problem tackled by OOD detection and OSR, re-evaluating state-of-the-art OOD detection and OSR methods in this setting; (iii) We surprisingly find that the best performing method on standard benchmarks (Outlier Exposure) struggles when tested at scale, while scoring rules which are sensitive to the deep feature magnitude consistently show promise; and (iv) We conduct empirical analysis to explain these phenomena and highlight directions for future research. Code: https://github.com/Visual-AI/Dissect-OOD-OSR
Paper Structure (14 sections, 4 equations, 15 figures, 13 tables)

This paper contains 14 sections, 4 equations, 15 figures, 13 tables.

Figures (15)

  • Figure 1: Semantic shift vs. covariate shift. We systematically perform cross-evaluation between SOTA methods for OSR and OOD detection and propose a large-scale benchmark setting in which we disentangle the tasks tackled in the two fields, proposing that they tackle semantic shift (x-axis) and covariate shift (y-axis) respectively.
  • Figure 2: Visualization of feature projections for images from ID and open-set / OOD datasets. We project the features into a two-dimensional space using an additional linear layer with an output dimension of two after the penultimate layer. We conduct OOD detection and OSR experiments using ResNet-18 on CIFAR-10 (first row) and ResNet-50 on CUB (second row) datasets. For CIFAR-10, the OOD experiment uses the full CIFAR-10 dataset as ID data and Textures as OOD data, while the OSR experiment utilizes the first six classes in CIFAR-10 as ID data and the remaining four as open-set data. For CUB, the OOD experiment employs the full CUB dataset as ID and Waterbird as OOD data, while the OSR experiment uses six classes in CUB as ID data and four CUB classes as open-set data. These classes are randomly selected from the ID and open-set splits introduced in SSB. Notably, these visualizations reveal that the feature magnitudes of ID data exceed those of OOD or OSR data.
  • Figure 3: OSR performance vs. OOD detection performance of different training methods averaged across various scoring rules. CLIP variants are included here for reference and are not used to fit the correlation.
  • Figure 4: OOD detection performance of various scoring rules averaged across different models. Magnitude-aware scoring rules, particularly MLS, are the most efficient and stable techniques.
  • Figure 5: Histogram of activations for ResNet-18 pretrained on a subset of CIFAR10 with four training classes and evaluated on: training and ID testing data; open-set data (disjoint six classes in CIFAR10) and OOD data (from Textures, LSUN and Places365). Specifically, each subplot shows the maximum activation (along channel, width and height dimension) at the outputs from layer_1 to layer_4 of ResNet-18, displayed from left to right in the figures. The behavior of OE is different from CE, whose activation maps become more separable in the deeper rather than the shallower layers. See \ref{['apd:B']} in Appendix for results on more datasets.
  • ...and 10 more figures