Table of Contents
Fetching ...

Toward a Realistic Benchmark for Out-of-Distribution Detection

Pietro Recalcati, Fabio Garcea, Luca Piano, Fabrizio Lamberti, Lia Morra

TL;DR

This work addresses the gap between traditional OOD benchmarks and real-world open-world image tagging by proposing a semantic-affinity based benchmark that pairs Places365 as in-distribution with ImageNet-derived out-of-distribution samples. It formalizes the problem, compares multiple OOD scoring techniques (MSP, MLV, ODIN, OODL), and evaluates them on datasets designed to span far- and near-OOD conditions, including WordNet-based ImageNet variants and FACETS-derived sets. The findings show that benchmark design strongly influences method rankings, with confidence-based scores offering advantages on near-OOD samples, while more complex benchmarks challenge all methods and reveal biases in predictions. The proposed benchmarks provide a more realistic platform for evaluating OOD detectors, guiding future research toward robust open-world recognition in high-resolution, semantically rich settings.

Abstract

Deep neural networks are increasingly used in a wide range of technologies and services, but remain highly susceptible to out-of-distribution (OOD) samples, that is, drawn from a different distribution than the original training set. A common approach to address this issue is to endow deep neural networks with the ability to detect OOD samples. Several benchmarks have been proposed to design and validate OOD detection techniques. However, many of them are based on far-OOD samples drawn from very different distributions, and thus lack the complexity needed to capture the nuances of real-world scenarios. In this work, we introduce a comprehensive benchmark for OOD detection, based on ImageNet and Places365, that assigns individual classes as in-distribution or out-of-distribution depending on the semantic similarity with the training set. Several techniques can be used to determine which classes should be considered in-distribution, yielding benchmarks with varying properties. Experimental results on different OOD detection techniques show how their measured efficacy depends on the selected benchmark and how confidence-based techniques may outperform classifier-based ones on near-OOD samples.

Toward a Realistic Benchmark for Out-of-Distribution Detection

TL;DR

This work addresses the gap between traditional OOD benchmarks and real-world open-world image tagging by proposing a semantic-affinity based benchmark that pairs Places365 as in-distribution with ImageNet-derived out-of-distribution samples. It formalizes the problem, compares multiple OOD scoring techniques (MSP, MLV, ODIN, OODL), and evaluates them on datasets designed to span far- and near-OOD conditions, including WordNet-based ImageNet variants and FACETS-derived sets. The findings show that benchmark design strongly influences method rankings, with confidence-based scores offering advantages on near-OOD samples, while more complex benchmarks challenge all methods and reveal biases in predictions. The proposed benchmarks provide a more realistic platform for evaluating OOD detectors, guiding future research toward robust open-world recognition in high-resolution, semantically rich settings.

Abstract

Deep neural networks are increasingly used in a wide range of technologies and services, but remain highly susceptible to out-of-distribution (OOD) samples, that is, drawn from a different distribution than the original training set. A common approach to address this issue is to endow deep neural networks with the ability to detect OOD samples. Several benchmarks have been proposed to design and validate OOD detection techniques. However, many of them are based on far-OOD samples drawn from very different distributions, and thus lack the complexity needed to capture the nuances of real-world scenarios. In this work, we introduce a comprehensive benchmark for OOD detection, based on ImageNet and Places365, that assigns individual classes as in-distribution or out-of-distribution depending on the semantic similarity with the training set. Several techniques can be used to determine which classes should be considered in-distribution, yielding benchmarks with varying properties. Experimental results on different OOD detection techniques show how their measured efficacy depends on the selected benchmark and how confidence-based techniques may outperform classifier-based ones on near-OOD samples.
Paper Structure (12 sections, 7 equations, 4 figures, 7 tables)

This paper contains 12 sections, 7 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Examples of predictions generated by a classifier pretrained on Places365 (right) on samples drawn from the ImageNet dataset (left). The predicted classes, although different from the ImageNet labels, are highly semantically correlated.
  • Figure 2: Average semantic similarity between the ground truth class and the predicted ID labels, as computed on the validation split of the FACETS OOD Detection T1. SUN ground truth labels (sun prefix) were generally semantically similar to the predicted ID class if compared to ImageNet ground truth labels (in prefix).
  • Figure 3: Average semantic similarity between the OOD ground truth class and the predicted ID labels, as computed on the validation split of the FACETS OOD Detection T1. Man-made environments seems to be less ambiguous and predicted classes are most likely to be semantically similar to the respective OOD ground truth class.
  • Figure 4: Examples of strong edges between classes representing the same concepts with slightly different names. The underscore prevented shoe shop and home theater to be paired with their counterparts, whereas different wording or spelling were responsible for mismatches in the case of cubicle/office and carrousel.