Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks
Hongjun Wang, Sagar Vaze, Kai Han
TL;DR
The paper analyzes OOD detection and OSR as two related distribution-shift tasks, revealing strong cross-field performance correlations and introducing a large-scale benchmark that separately evaluates semantic shift (OSR) and covariate shift (OOD). It demonstrates that magnitude-aware scoring rules (notably MLS and Energy) are robust across settings, while Outlier Exposure (OE) saturates on standard benchmarks but struggles to scale when distribution gaps widen. The authors show that the efficacy of OE depends critically on the similarity between auxiliary OOD data and the test OOD distribution, and they introduce the Outlier-Aware Accuracy metric to capture robustness–detection trade-offs. The proposed benchmark and findings offer practical guidance for evaluating and developing methods under real-world distribution shifts and lay a foundation for future unified studies of OOD and OSR.
Abstract
Detecting test-time distribution shift has emerged as a key capability for safely deployed machine learning models, with the question being tackled under various guises in recent years. In this paper, we aim to provide a consolidated view of the two largest sub-fields within the community: out-of-distribution (OOD) detection and open-set recognition (OSR). In particular, we aim to provide rigorous empirical analysis of different methods across settings and provide actionable takeaways for practitioners and researchers. Concretely, we make the following contributions: (i) We perform rigorous cross-evaluation between state-of-the-art methods in the OOD detection and OSR settings and identify a strong correlation between the performances of methods for them; (ii) We propose a new, large-scale benchmark setting which we suggest better disentangles the problem tackled by OOD detection and OSR, re-evaluating state-of-the-art OOD detection and OSR methods in this setting; (iii) We surprisingly find that the best performing method on standard benchmarks (Outlier Exposure) struggles when tested at scale, while scoring rules which are sensitive to the deep feature magnitude consistently show promise; and (iv) We conduct empirical analysis to explain these phenomena and highlight directions for future research. Code: https://github.com/Visual-AI/Dissect-OOD-OSR
