A noisy elephant in the room: Is your out-of-distribution detector robust to label noise?

Galadrielle Humblot-Renaux; Sergio Escalera; Thomas B. Moeslund

A noisy elephant in the room: Is your out-of-distribution detector robust to label noise?

Galadrielle Humblot-Renaux, Sergio Escalera, Thomas B. Moeslund

TL;DR

This paper investigates how robust post-hoc OOD detectors are when the underlying image classifier is trained on noisy labels. By benchmarking 20 detectors across 396 classifiers on 22 ID datasets with real and synthetic label noise, and evaluating on 7 diverse OOD datasets, it shows that label noise degrades OOD performance and that many detectors fail to separate ID misclassifications from OOD samples. Distance-based, feature-space methods such as GRAM and MDSEnsemble demonstrate relatively stronger resilience, whereas logit-based approaches are more sensitive to label noise; the relationship between ID accuracy and OOD detection is nuanced and varies by method. The work highlights practical considerations for evaluating OOD detectors under imperfect supervision and points to directions for robust, noise-aware OOD methods with potential impact on safe deployment and real-world reliability.

Abstract

The ability to detect unfamiliar or unexpected images is essential for safe deployment of computer vision systems. In the context of classification, the task of detecting images outside of a model's training domain is known as out-of-distribution (OOD) detection. While there has been a growing research interest in developing post-hoc OOD detection methods, there has been comparably little discussion around how these methods perform when the underlying classifier is not trained on a clean, carefully curated dataset. In this work, we take a closer look at 20 state-of-the-art OOD detection methods in the (more realistic) scenario where the labels used to train the underlying classifier are unreliable (e.g. crowd-sourced or web-scraped labels). Extensive experiments across different datasets, noise types & levels, architectures and checkpointing strategies provide insights into the effect of class label noise on OOD detection, and show that poor separation between incorrectly classified ID samples vs. OOD samples is an overlooked yet important limitation of existing methods. Code: https://github.com/glhr/ood-labelnoise

A noisy elephant in the room: Is your out-of-distribution detector robust to label noise?

TL;DR

Abstract

Paper Structure (13 sections, 7 figures, 2 tables)

This paper contains 13 sections, 7 figures, 2 tables.

Introduction
Problem set-up
Related work
OOD detection methods
Experiments
Analysis
Where there's noise there's trouble
Does accuracy tell the whole story?
Design features which hurt or help
Let's not forget about the validation set
What about a more realistic setting?
Zooming out
Acknowledgements

Figures (7)

Figure 1: Can state-of-the-art OOD detectors tell incorrectly classified ID images apart from OOD inputs? Not really. Here we compare their performance across 396 trained classifiers.
Figure 2: Distribution of OOD detection performance across methods & models when training the classifier on different label sets.
Figure 3: Does OOD detection performance (AUROCID vs. OOD) correlate with ID classification performance (accuracy)? We measure the rank correlation across different architectures, seeds, checkpoints, and datasets for different label sets. All results shown here are statistically significant ($p << 0.001$).
Figure 4: Relationship between ID classification performance and OOD detection performance, considering all ID test samples (top) or only incorrectly classified ones (bottom) in the AUROC metric. Each point corresponds to a single model.
Figure 5: Max Logit ID and OOD score statistics across models trained on Clothing1M, for different noise types & checkpointing.
...and 2 more figures

A noisy elephant in the room: Is your out-of-distribution detector robust to label noise?

TL;DR

Abstract

A noisy elephant in the room: Is your out-of-distribution detector robust to label noise?

Authors

TL;DR

Abstract

Table of Contents

Figures (7)