Table of Contents
Fetching ...

Detecting and Rectifying Noisy Labels: A Similarity-based Approach

Dang Huu-Tien, Minh-Phuong Nguyen, Naoya Inoue

TL;DR

This paper addresses label noise in large datasets by introducing post-hoc, model-agnostic noise detection and rectification using penultimate-layer features. It provides a theoretical justification showing that mislabeled points tend to be more similar to their true class in the penultimate space, and proposes two similarity-based methods, Sim-Cos and Sim-Dot, to detect and correct noisy labels with an auxiliary dataset. Empirical results on Snippets and IMDB demonstrate that the similarity-based approaches outperform confidence- and gradient-based baselines across several noise scenarios and model architectures, with robust performance as the auxiliary data size grows and as more neighbors are considered. The work offers practical tools for automatic dataset cleaning that can improve downstream generalization, while noting sensitivity to hyperparameters and auxiliary data choices, and providing an implementation link for reproduction.

Abstract

Label noise in datasets could significantly damage the performance and robustness of deep neural networks (DNNs) trained on these datasets. As the size of modern DNNs grows, there is a growing demand for automated tools for detecting such errors. In this paper, we propose post-hoc, model-agnostic noise detection and rectification methods utilizing the penultimate feature from a DNN. Our idea is based on the observation that the similarity between the penultimate feature of a mislabeled data point and its true class data points is higher than that for data points from other classes, making the probability of label occurrence within a tight, similar cluster informative for detecting and rectifying errors. Through theoretical and empirical analyses, we demonstrate that our approach achieves high detection performance across diverse, realistic noise scenarios and can automatically rectify these errors to improve dataset quality. Our implementation is available at https://anonymous.4open.science/r/noise-detection-and-rectification-AD8E.

Detecting and Rectifying Noisy Labels: A Similarity-based Approach

TL;DR

This paper addresses label noise in large datasets by introducing post-hoc, model-agnostic noise detection and rectification using penultimate-layer features. It provides a theoretical justification showing that mislabeled points tend to be more similar to their true class in the penultimate space, and proposes two similarity-based methods, Sim-Cos and Sim-Dot, to detect and correct noisy labels with an auxiliary dataset. Empirical results on Snippets and IMDB demonstrate that the similarity-based approaches outperform confidence- and gradient-based baselines across several noise scenarios and model architectures, with robust performance as the auxiliary data size grows and as more neighbors are considered. The work offers practical tools for automatic dataset cleaning that can improve downstream generalization, while noting sensitivity to hyperparameters and auxiliary data choices, and providing an implementation link for reproduction.

Abstract

Label noise in datasets could significantly damage the performance and robustness of deep neural networks (DNNs) trained on these datasets. As the size of modern DNNs grows, there is a growing demand for automated tools for detecting such errors. In this paper, we propose post-hoc, model-agnostic noise detection and rectification methods utilizing the penultimate feature from a DNN. Our idea is based on the observation that the similarity between the penultimate feature of a mislabeled data point and its true class data points is higher than that for data points from other classes, making the probability of label occurrence within a tight, similar cluster informative for detecting and rectifying errors. Through theoretical and empirical analyses, we demonstrate that our approach achieves high detection performance across diverse, realistic noise scenarios and can automatically rectify these errors to improve dataset quality. Our implementation is available at https://anonymous.4open.science/r/noise-detection-and-rectification-AD8E.

Paper Structure

This paper contains 18 sections, 20 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Distribution of (a) Cosine similarity and (b) Dot product over IMDB imdb with $10\%$ noise. Blue bars represent the similarity between mislabeled data points and their true class data points, red bars represent the similarity between mislabeled data points and other class data points. Features are obtained from a trained BERT.
  • Figure 2: Noise detection accuracy of methods measured on Snippets (Fig. \ref{['fig2']}a-c) and IMDB (Fig. \ref{['fig2']}d-f) across various sizes and noise types.
  • Figure 3: Detection accuracy of models on Snippets (left) and IMDB (right) across sizes of $\mathcal{D}_{\textnormal{aux}}$.
  • Figure 4: Detection accuracy of similarity-based methods on Snippets (left) and IMDB (right) for $k \in \{5,10,20,50,100,200\}$.
  • Figure 5: Noise reduction rate across threshold $\tau \in \{0.5,0.6,0.7,0.8, 0.9, 0.99\}$.