Table of Contents
Fetching ...

On the Role of Label Noise in the Feature Learning Process

Andi Han, Wei Huang, Zhanpeng Zhou, Gang Niu, Wuyang Chen, Junchi Yan, Akiko Takeda, Taiji Suzuki

TL;DR

This paper analyzes how label noise shapes feature learning by modeling a signal-noise data distribution and studying a two-layer CNN trained with gradient descent under logistic loss. It uncovers a two-stage learning dynamic: Stage I where the network fits clean samples and learns the signal, and Stage II where training converges and memorizes noisy labels, reducing generalization. The authors provide theoretical support for early stopping and small-loss sample selection, and validate the theory with synthetic and real-data experiments, including SHAP-based interpretability. The results advance understanding of label-noise robustness in nonlinear feature learning and suggest practical guidelines for training with noisy labels.

Abstract

Deep learning with noisy labels presents significant challenges. In this work, we theoretically characterize the role of label noise from a feature learning perspective. Specifically, we consider a signal-noise data distribution, where each sample comprises a label-dependent signal and label-independent noise, and rigorously analyze the training dynamics of a two-layer convolutional neural network under this data setup, along with the presence of label noise. Our analysis identifies two key stages. In Stage I, the model perfectly fits all the clean samples (i.e., samples without label noise) while ignoring the noisy ones (i.e., samples with noisy labels). During this stage, the model learns the signal from the clean samples, which generalizes well on unseen data. In Stage II, as the training loss converges, the gradient in the direction of noise surpasses that of the signal, leading to overfitting on noisy samples. Eventually, the model memorizes the noise present in the noisy samples and degrades its generalization ability. Furthermore, our analysis provides a theoretical basis for two widely used techniques for tackling label noise: early stopping and sample selection. Experiments on both synthetic and real-world setups validate our theory.

On the Role of Label Noise in the Feature Learning Process

TL;DR

This paper analyzes how label noise shapes feature learning by modeling a signal-noise data distribution and studying a two-layer CNN trained with gradient descent under logistic loss. It uncovers a two-stage learning dynamic: Stage I where the network fits clean samples and learns the signal, and Stage II where training converges and memorizes noisy labels, reducing generalization. The authors provide theoretical support for early stopping and small-loss sample selection, and validate the theory with synthetic and real-data experiments, including SHAP-based interpretability. The results advance understanding of label-noise robustness in nonlinear feature learning and suggest practical guidelines for training with noisy labels.

Abstract

Deep learning with noisy labels presents significant challenges. In this work, we theoretically characterize the role of label noise from a feature learning perspective. Specifically, we consider a signal-noise data distribution, where each sample comprises a label-dependent signal and label-independent noise, and rigorously analyze the training dynamics of a two-layer convolutional neural network under this data setup, along with the presence of label noise. Our analysis identifies two key stages. In Stage I, the model perfectly fits all the clean samples (i.e., samples without label noise) while ignoring the noisy ones (i.e., samples with noisy labels). During this stage, the model learns the signal from the clean samples, which generalizes well on unseen data. In Stage II, as the training loss converges, the gradient in the direction of noise surpasses that of the signal, leading to overfitting on noisy samples. Eventually, the model memorizes the noise present in the noisy samples and degrades its generalization ability. Furthermore, our analysis provides a theoretical basis for two widely used techniques for tackling label noise: early stopping and sample selection. Experiments on both synthetic and real-world setups validate our theory.

Paper Structure

This paper contains 26 sections, 28 theorems, 140 equations, 4 figures, 2 tables.

Key Result

Theorem 4.1

Under Condition ass:main, there exists $T_1 = \Theta (\eta^{-1} nm \sigma_\xi^{-2}d^{-1} )$ such that ${\overline{\rho}}_{\tilde{y}_i, r, i}^{(T_1)} = \Theta(1)$ for all $i \in [n]$, $r \in [m]$ with $\langle {\mathbf w}^{(0)}_{\tilde{y}_i, r}, {\boldsymbol{\xi}}_i \rangle \geq 0$ and $\gamma_{j,r}^

Figures (4)

  • Figure 1: Experimental validation under the synthetic setup, with label noise (left) and without label noise (right).(Top) The change in $\max_{j,r} \gamma_{j,r}$ (signal learning) and $\max_{j,r} \rho_{j,r,i}$ (noise memorization) on noisy (i.e., when $y_i \neq \tilde{y}_i$) and clean samples (i.e., when $y_i = \tilde{y}_i$) w.r.t the training iteration $t$. (Bottom) The change in overall training accuracy $Acc_{\mathcal{D}_{\textrm{train}}}$, as well as the accuracy on clean $Acc_{\mathcal{D}_{\textrm{train, clean}}}$ and noisy samples $Acc_{\mathcal{D}_{\textrm{train, noisy}}}$, w.r.t the training iteration $t$ for models under different settings. Note that there are no noisy samples when training without label noise; thus we only plot noise memorization on clean samples and the overall training accuracy. The gray dashed line separates the two stages for training with label noise. More experimental results are in \ref{['suppl:additional_exp']}.
  • Figure 2: Experimental validation in real-world scenarios. Two VGG nets are trained on the first two categories of CIFAR-10 under nearly identical settings. One is trained with label noise and the other without. (Left) The accuracy curves for the two models. Here, $Acc_{\mathcal{D}_{\textrm{train}}}$ and $Acc_{\mathcal{D}_{\textrm{test}}}$ represent the accuracy on the entire training and test sets, respectively, while $Acc_{\mathcal{D}_{\textrm{train, clean}}}$ and $Acc_{\mathcal{D}_{\textrm{train, noisy}}}$ specifically denote the accuracy on clean and noisy samples from the training set. (Right) Visualization of model predictions (via SHAP lundberg2017shap) for noisy samples across multiple epochs. Red regions indicate positive contributions to model predictions, while blue regions denote negative contributions, with darker regions signifying greater contributions. More experimental results are in \ref{['suppl:additional_exp']}.
  • Figure 3: Experiments on synthetic data with varying problem settings, including varying signal strength $\mu$ and label noise ratio $\tau$. We shade the area before noise learning overtakes signal learning of noisy samples in blue. This corresponds to the Stage I in our analysis, where early stopping is beneficial. We shade the area where signal learning exceeds noise learning for noisy samples in orange, which corresponds to Stage II in our analysis.
  • Figure 4: Experiments on CIFAR-10 dataset with varying label noise ratio $\tau$. Across different label noise ratios, we observe a similar pattern that there exist an initial decrease in the training accuracy on noisy samples before an increase to perfect classification. This validates our theoretical findings in real-world settings under various label noise ratios.

Theorems & Definitions (46)

  • Theorem 4.1
  • Theorem 4.2
  • Proposition 4.3: Early stopping and sample selection
  • Theorem 4.4
  • Lemma 5.1
  • Proposition 5.2
  • Lemma 5.3
  • Lemma 2.1: cao2022benignkou2023benign
  • Lemma 2.2: cao2022benignkou2023benign
  • Lemma 2.3: kou2023benign
  • ...and 36 more