An accurate detection is not all you need to combat label noise in web-noisy datasets

Paul Albert; Jack Valmadre; Eric Arazo; Tarun Krishna; Noel E. O'Connor; Kevin McGuinness

An accurate detection is not all you need to combat label noise in web-noisy datasets

Paul Albert, Jack Valmadre, Eric Arazo, Tarun Krishna, Noel E. O'Connor, Kevin McGuinness

TL;DR

The paper addresses learning from web-noisy image data by investigating the linear separability between ID and OOD samples in unsupervised contrastive representations and showing that direct OOD hyperplane estimation improves detection but does not always boost classification due to missed informative clean samples. To overcome this, it introduces Linear Separation Alternating (LSA), which blends linear-separation noise detection $W$ with a state-of-the-art small-loss detector $Z$ in an alternating schedule, yielding the $PLS ext{-}LSA$ framework, with an extended co-training variant $PLS ext{-}LSA^+$. Empirical results on CNWL, mini-Webvision, and Webly-fg indicate that pure noise-detection gains do not guarantee better accuracy, but the hybrid LSA approach yields state-of-the-art or near state-of-the-art classification in real-world web-noise settings, aided by ablations that highlight the importance of early-layer features, strong data augmentation, and co-training. The findings emphasize the value of combining complementary noise-detection signals and provide practical guidance for robust learning under web noise, including strategies for trusted-subset use and depth-aware linear separation. Overall, the work advances robust image classification in noisy web data by integrating unsupervised ID/OOD separation with supervised noise correction in a principled alternating framework.

Abstract

Training a classifier on web-crawled data demands learning algorithms that are robust to annotation errors and irrelevant examples. This paper builds upon the recent empirical observation that applying unsupervised contrastive learning to noisy, web-crawled datasets yields a feature representation under which the in-distribution (ID) and out-of-distribution (OOD) samples are linearly separable. We show that direct estimation of the separating hyperplane can indeed offer an accurate detection of OOD samples, and yet, surprisingly, this detection does not translate into gains in classification accuracy. Digging deeper into this phenomenon, we discover that the near-perfect detection misses a type of clean examples that are valuable for supervised learning. These examples often represent visually simple images, which are relatively easy to identify as clean examples using standard loss- or distance-based methods despite being poorly separated from the OOD distribution using unsupervised learning. Because we further observe a low correlation with SOTA metrics, this urges us to propose a hybrid solution that alternates between noise detection using linear separation and a state-of-the-art (SOTA) small-loss approach. When combined with the SOTA algorithm PLS, we substantially improve SOTA results for real-world image classification in the presence of web noise github.com/PaulAlbert31/LSA

An accurate detection is not all you need to combat label noise in web-noisy datasets

TL;DR

with a state-of-the-art small-loss detector

in an alternating schedule, yielding the

framework, with an extended co-training variant

. Empirical results on CNWL, mini-Webvision, and Webly-fg indicate that pure noise-detection gains do not guarantee better accuracy, but the hybrid LSA approach yields state-of-the-art or near state-of-the-art classification in real-world web-noise settings, aided by ablations that highlight the importance of early-layer features, strong data augmentation, and co-training. The findings emphasize the value of combining complementary noise-detection signals and provide practical guidance for robust learning under web noise, including strategies for trusted-subset use and depth-aware linear separation. Overall, the work advances robust image classification in noisy web data by integrating unsupervised ID/OOD separation with supervised noise correction in a principled alternating framework.

Abstract

Paper Structure (43 sections, 3 equations, 13 figures, 14 tables)

This paper contains 43 sections, 3 equations, 13 figures, 14 tables.

Introduction
Related work
Detection and correction of incorrect labels
Out-of-distribution noise in web-noisy datasets
Unsupervised learning and label noise
Linear Separation Alternating (LSA)
Identifying OOD images in web-noisy datasets
Linear separation improves in deeper layers
Estimating the linear separator
Does better noise detection imply better classification?
Clean samples missed by the linear separation
Linear Separation Alternating
PLS-LSA
Semi-supervised imputation and Co-training
Experiments
...and 28 more sections

Figures (13)

Figure 1: Extending the work of 2022_ECCV_SNCF we observe that for web noise (CNWL), ID and OOD samples become more separable in earlier representations in the network
Figure 1: ROC for different noise-retrieval metrics. We report PLS (loss-based) and RRL (feature-based), the refined detection when they are used as a support set for the logistic regressor ($W_{PLS}$ and $W_{RRL}$ respectively) and results where trusted examples (100, 1k or 10k) are used for training the logistic regressor. Features extracted after the block 2 of a PreAct ResNet18.
Figure 2: Examples of clean samples missed by our linear separation $W_{PLS}$ but correctly recovered (green) by a small loss approach, here PLS. $20\%$ noise CNWL.
Figure 2: Clean samples missed by our linear separation but retrieved by PLS or RRL. PLS-LSA trained on the CNWL $20\%$.
Figure 3: Low correlation of our linear separation with the PLS and RRL metrics trained on CNWL with $20\%$ web noise. $W_{PLS/RRL}$ denotes using PLS or RRL for $\hat{\mathcal{T}}$.
...and 8 more figures

An accurate detection is not all you need to combat label noise in web-noisy datasets

TL;DR

Abstract

An accurate detection is not all you need to combat label noise in web-noisy datasets

Authors

TL;DR

Abstract

Table of Contents

Figures (13)