A noisy elephant in the room: Is your out-of-distribution detector robust to label noise?
Galadrielle Humblot-Renaux, Sergio Escalera, Thomas B. Moeslund
TL;DR
This paper investigates how robust post-hoc OOD detectors are when the underlying image classifier is trained on noisy labels. By benchmarking 20 detectors across 396 classifiers on 22 ID datasets with real and synthetic label noise, and evaluating on 7 diverse OOD datasets, it shows that label noise degrades OOD performance and that many detectors fail to separate ID misclassifications from OOD samples. Distance-based, feature-space methods such as GRAM and MDSEnsemble demonstrate relatively stronger resilience, whereas logit-based approaches are more sensitive to label noise; the relationship between ID accuracy and OOD detection is nuanced and varies by method. The work highlights practical considerations for evaluating OOD detectors under imperfect supervision and points to directions for robust, noise-aware OOD methods with potential impact on safe deployment and real-world reliability.
Abstract
The ability to detect unfamiliar or unexpected images is essential for safe deployment of computer vision systems. In the context of classification, the task of detecting images outside of a model's training domain is known as out-of-distribution (OOD) detection. While there has been a growing research interest in developing post-hoc OOD detection methods, there has been comparably little discussion around how these methods perform when the underlying classifier is not trained on a clean, carefully curated dataset. In this work, we take a closer look at 20 state-of-the-art OOD detection methods in the (more realistic) scenario where the labels used to train the underlying classifier are unreliable (e.g. crowd-sourced or web-scraped labels). Extensive experiments across different datasets, noise types & levels, architectures and checkpointing strategies provide insights into the effect of class label noise on OOD detection, and show that poor separation between incorrectly classified ID samples vs. OOD samples is an overlooked yet important limitation of existing methods. Code: https://github.com/glhr/ood-labelnoise
