Toward a Realistic Benchmark for Out-of-Distribution Detection
Pietro Recalcati, Fabio Garcea, Luca Piano, Fabrizio Lamberti, Lia Morra
TL;DR
This work addresses the gap between traditional OOD benchmarks and real-world open-world image tagging by proposing a semantic-affinity based benchmark that pairs Places365 as in-distribution with ImageNet-derived out-of-distribution samples. It formalizes the problem, compares multiple OOD scoring techniques (MSP, MLV, ODIN, OODL), and evaluates them on datasets designed to span far- and near-OOD conditions, including WordNet-based ImageNet variants and FACETS-derived sets. The findings show that benchmark design strongly influences method rankings, with confidence-based scores offering advantages on near-OOD samples, while more complex benchmarks challenge all methods and reveal biases in predictions. The proposed benchmarks provide a more realistic platform for evaluating OOD detectors, guiding future research toward robust open-world recognition in high-resolution, semantically rich settings.
Abstract
Deep neural networks are increasingly used in a wide range of technologies and services, but remain highly susceptible to out-of-distribution (OOD) samples, that is, drawn from a different distribution than the original training set. A common approach to address this issue is to endow deep neural networks with the ability to detect OOD samples. Several benchmarks have been proposed to design and validate OOD detection techniques. However, many of them are based on far-OOD samples drawn from very different distributions, and thus lack the complexity needed to capture the nuances of real-world scenarios. In this work, we introduce a comprehensive benchmark for OOD detection, based on ImageNet and Places365, that assigns individual classes as in-distribution or out-of-distribution depending on the semantic similarity with the training set. Several techniques can be used to determine which classes should be considered in-distribution, yielding benchmarks with varying properties. Experimental results on different OOD detection techniques show how their measured efficacy depends on the selected benchmark and how confidence-based techniques may outperform classifier-based ones on near-OOD samples.
