Table of Contents
Fetching ...

ImageNet-OOD: Deciphering Modern Out-of-Distribution Detection Algorithms

William Yang, Byron Zhang, Olga Russakovsky

TL;DR

This work introduces ImageNet-OOD to isolate semantic out-of-distribution (OOD) detection from covariate shift, revealing that modern detectors are disproportionately influenced by covariate shifts and often offer minimal gains over the MSP baseline when covariate interference is controlled. By carefully constructing ImageNet-OOD from ImageNet-21K and removing contamination from ImageNet-1K, the authors show that many previously reported improvements on semantic shift benchmarks do not translate to real-world OOD detection under covariate shift. Across nine detectors and thirteen architectures, the study demonstrates that detector performance is more sensitive to covariate shifts (e.g., ImageNet-R) than to pure semantic shift (ImageNet-OOD), and that a sanity check with random models confirms covariance-driven biases. The findings challenge the practical utility of current OOD detectors for semantic shift and stress the need for methods that robustly differentiate semantic shifts from covariate-driven cues, with implications for safer, more reliable deployment of vision systems.

Abstract

The task of out-of-distribution (OOD) detection is notoriously ill-defined. Earlier works focused on new-class detection, aiming to identify label-altering data distribution shifts, also known as "semantic shift." However, recent works argue for a focus on failure detection, expanding the OOD evaluation framework to account for label-preserving data distribution shifts, also known as "covariate shift." Intriguingly, under this new framework, complex OOD detectors that were previously considered state-of-the-art now perform similarly to, or even worse than the simple maximum softmax probability baseline. This raises the question: what are the latest OOD detectors actually detecting? Deciphering the behavior of OOD detection algorithms requires evaluation datasets that decouples semantic shift and covariate shift. To aid our investigations, we present ImageNet-OOD, a clean semantic shift dataset that minimizes the interference of covariate shift. Through comprehensive experiments, we show that OOD detectors are more sensitive to covariate shift than to semantic shift, and the benefits of recent OOD detection algorithms on semantic shift detection is minimal. Our dataset and analyses provide important insights for guiding the design of future OOD detectors.

ImageNet-OOD: Deciphering Modern Out-of-Distribution Detection Algorithms

TL;DR

This work introduces ImageNet-OOD to isolate semantic out-of-distribution (OOD) detection from covariate shift, revealing that modern detectors are disproportionately influenced by covariate shifts and often offer minimal gains over the MSP baseline when covariate interference is controlled. By carefully constructing ImageNet-OOD from ImageNet-21K and removing contamination from ImageNet-1K, the authors show that many previously reported improvements on semantic shift benchmarks do not translate to real-world OOD detection under covariate shift. Across nine detectors and thirteen architectures, the study demonstrates that detector performance is more sensitive to covariate shifts (e.g., ImageNet-R) than to pure semantic shift (ImageNet-OOD), and that a sanity check with random models confirms covariance-driven biases. The findings challenge the practical utility of current OOD detectors for semantic shift and stress the need for methods that robustly differentiate semantic shifts from covariate-driven cues, with implications for safer, more reliable deployment of vision systems.

Abstract

The task of out-of-distribution (OOD) detection is notoriously ill-defined. Earlier works focused on new-class detection, aiming to identify label-altering data distribution shifts, also known as "semantic shift." However, recent works argue for a focus on failure detection, expanding the OOD evaluation framework to account for label-preserving data distribution shifts, also known as "covariate shift." Intriguingly, under this new framework, complex OOD detectors that were previously considered state-of-the-art now perform similarly to, or even worse than the simple maximum softmax probability baseline. This raises the question: what are the latest OOD detectors actually detecting? Deciphering the behavior of OOD detection algorithms requires evaluation datasets that decouples semantic shift and covariate shift. To aid our investigations, we present ImageNet-OOD, a clean semantic shift dataset that minimizes the interference of covariate shift. Through comprehensive experiments, we show that OOD detectors are more sensitive to covariate shift than to semantic shift, and the benefits of recent OOD detection algorithms on semantic shift detection is minimal. Our dataset and analyses provide important insights for guiding the design of future OOD detectors.
Paper Structure (38 sections, 2 equations, 13 figures, 4 tables)

This paper contains 38 sections, 2 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Removing ambiguities in ImageNet-OOD. We identify classes in ImageNet-21K which should not be included in the ImageNet-OOD dataset, since it would be ambiguous whether they are truly OOD with respect to the ImageNet-1K classes. Left: Semantic Ambiguity. "Frozen Dessert" in ImageNet-21K Imageneto13 should not be considered OOD as it is a hyponym of "Ice Cream." Additionally, classes associated with organism is problematic in the WordNet hierarchy: "Herbivore" contains images from the ImageNet-1K class "Cattle" but it is neither a hypernym or a hyponym. Middle: Semantically-grounded Covariate Shifts. A dog vs. vehicle classifier can also be thought of as an animal vs. vehicle classifier. Given this classifier, it is unclear whether "cat" should be considered OOD. Right: Visual Ambiguity. "Violin" and "Viola" or "Scuba Diver" and "Aqualung" are visually indistinguishable to human labelers, leading to potential annotation error.
  • Figure 2: Examples of Images from ImageNet-OOD. Images around the 10th, 30th, 50th, 70th, 90th percentile based on either the distance to the closest ImageNet-1K image using features from self-supervised ResNet-50 pre-trained on the PASS dataset pass60 or scores from OOD detectors MSP MSP01, Energy Energy02, ViM Vim05, and ReAct React08. Within each pair, the left image is the ImageNet-OOD image and the right image is its closest image in ImageNet-1K. These examples illustrate the diversity of ImageNet-OOD and its visual similarity to ImageNet-1K despite having different semantics and OOD scores.
  • Figure 3: Influence of Covariate Shift on OOD Detection.Left. Relationship between OOD detection performance and the average distance to the closest ImageNet-1K ILSVRC07 image using features from self-supervised models trained on the PASS pass60 dataset. Results reveal that given similar PASS feature distances between subsets of the two datasets, modern OOD detection algorithms elicit a stronger response to covariate shift (ImageNet-R imagenetr49) than semantic shift (ImageNet-OOD). Right. An image of Ostrich in ImageNet-1K dataset where an elementary zoom transformation is applied. The transformation did not influence the model prediction, but substantially decreased the ranking of ViM Vim05 and ReAct React08 scores in ImageNet-OOD by 38.4%, 39.6%, respectively.
  • Figure 4: Performance of OOD detection under random models. Five ResNet-50 models (indicated by color) with random parameters were evaluated on ImageNet-R (IN-R), ImageNet-C (IN-C) and ImageNet-OOD (IN-OOD).
  • Figure 5: Comparison of the ranking between MSP and Max-Logit .Left. MSP is slightly better at ranking correctly predicted ImageNet images higher. Center. Max-Logit ranks more incorrect ImageNet images higher than MSP. Right. MSP and Max-Logits have near identical ranks on ImageNet-OOD examples.
  • ...and 8 more figures