Table of Contents
Fetching ...

Optimized Gradient Clipping for Noisy Label Learning

Xichen Ye, Yifan Wu, Weizhong Zhang, Xiaoqiang Li, Yifan Chen, Cheng Jin

TL;DR

Optimized Gradient Clipping (OGC) tackles the instability of learning with noisy labels by dynamically adjusting the gradient clipping threshold $\tau^{(t)}$ at each training step. It does so by modeling clean and noisy cross-entropy losses with a 2-Gaussian Mixture Model and constraining the noise-to-clean gradient ratio after clipping, with a formal optimization and a mapping to transformed losses like $\bar{\ell}_{CE}$. The approach yields theoretical noise-tolerance guarantees and demonstrates superior performance across symmetric, asymmetric, instance-dependent, and real-world label noise, including substantial gains when combined with robust losses. The method remains efficient, adding modest overhead while enabling strong robustness on large-scale noisy datasets such as WebVision.

Abstract

Previous research has shown that constraining the gradient of loss function with respect to model-predicted probabilities can enhance the model robustness against noisy labels. These methods typically specify a fixed optimal threshold for gradient clipping through validation data to obtain the desired robustness against noise. However, this common practice overlooks the dynamic distribution of gradients from both clean and noisy-labeled samples at different stages of training, significantly limiting the model capability to adapt to the variable nature of gradients throughout the training process. To address this issue, we propose a simple yet effective approach called Optimized Gradient Clipping (OGC), which dynamically adjusts the clipping threshold based on the ratio of noise gradients to clean gradients after clipping, estimated by modeling the distributions of clean and noisy samples. This approach allows us to modify the clipping threshold at each training step, effectively controlling the influence of noise gradients. Additionally, we provide statistical analysis to certify the noise-tolerance ability of OGC. Our extensive experiments across various types of label noise, including symmetric, asymmetric, instance-dependent, and real-world noise, demonstrate the effectiveness of our approach.

Optimized Gradient Clipping for Noisy Label Learning

TL;DR

Optimized Gradient Clipping (OGC) tackles the instability of learning with noisy labels by dynamically adjusting the gradient clipping threshold at each training step. It does so by modeling clean and noisy cross-entropy losses with a 2-Gaussian Mixture Model and constraining the noise-to-clean gradient ratio after clipping, with a formal optimization and a mapping to transformed losses like . The approach yields theoretical noise-tolerance guarantees and demonstrates superior performance across symmetric, asymmetric, instance-dependent, and real-world label noise, including substantial gains when combined with robust losses. The method remains efficient, adding modest overhead while enabling strong robustness on large-scale noisy datasets such as WebVision.

Abstract

Previous research has shown that constraining the gradient of loss function with respect to model-predicted probabilities can enhance the model robustness against noisy labels. These methods typically specify a fixed optimal threshold for gradient clipping through validation data to obtain the desired robustness against noise. However, this common practice overlooks the dynamic distribution of gradients from both clean and noisy-labeled samples at different stages of training, significantly limiting the model capability to adapt to the variable nature of gradients throughout the training process. To address this issue, we propose a simple yet effective approach called Optimized Gradient Clipping (OGC), which dynamically adjusts the clipping threshold based on the ratio of noise gradients to clean gradients after clipping, estimated by modeling the distributions of clean and noisy samples. This approach allows us to modify the clipping threshold at each training step, effectively controlling the influence of noise gradients. Additionally, we provide statistical analysis to certify the noise-tolerance ability of OGC. Our extensive experiments across various types of label noise, including symmetric, asymmetric, instance-dependent, and real-world noise, demonstrate the effectiveness of our approach.

Paper Structure

This paper contains 26 sections, 4 theorems, 42 equations, 4 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

Given any classifier $f$, for any input sample pair $(\bm x, y)$ and any $\tau^{(t)} \ge 1$, the CE+OGC loss $\bar{\ell}_\text{CE}$ is both lower and upper bounded: Moreover, the sum of $\bar{\ell}_\text{CE}$ w.r.t all classes is thus also lower bound and upper bounded:

Figures (4)

  • Figure 1: KDE visualizations of gradient distributions for clean and noisy labels, along with decision boundary visualizations for a simple binary classification task that utilizes gradient clipping with a fixed threshold. The leftmost plot in the first row shows the raw training data. The subsequent plots illustrate shifts in gradient distributions (before clipping) at various training epochs (50, 500, 1000, and 1500). The second row displays the corresponding decision boundaries for each epoch.
  • Figure 2: Test accuracies (%) of different $\epsilon$ on CIFAR-10 dataset with different label noise.
  • Figure 3: The effect of queue size $q$.
  • Figure 4: The effect of interval $s$.

Theorems & Definitions (8)

  • Proposition 1
  • Theorem 1: Excess risk under instance-independent symmetric label noise
  • Theorem 2: Excess risk under instance-independent asymmetric label noise
  • Theorem 3: Excess risk under instance-dependent label noise
  • proof
  • proof
  • proof
  • proof