Table of Contents
Fetching ...

Robust Learning under Hybrid Noise

Yang Wei, Shuo Chen, Shanshan Ye, Bo Han, Chen Gong

TL;DR

Hybrid noise, arising from simultaneous feature and label corruption, is addressed by FLR, a unified data-recovery framework that recovers clean features and labels via a double low-rank formulation. The method couples low-rank feature-to-label projection with adaptive noise regularizers and solves the resulting non-convex problem using ADMM, with convergence guarantees and a generalization bound based on Rademacher complexity. Empirical results on UCI benchmarks, CIFAR-10, and CIFAR-10N demonstrate that FLR outperforms state-of-the-art robust learning methods across diverse noise regimes. The work provides a principled, scalable approach for robust learning in real-world noisy environments and suggests avenues for integrating deep representations to model nonlinearity while maintaining data-recovery guarantees.

Abstract

Feature noise and label noise are ubiquitous in practical scenarios, which pose great challenges for training a robust machine learning model. Most previous approaches usually deal with only a single problem of either feature noise or label noise. However, in real-world applications, hybrid noise, which contains both feature noise and label noise, is very common due to the unreliable data collection and annotation processes. Although some results have been achieved by a few representation learning based attempts, this issue is still far from being addressed with promising performance and guaranteed theoretical analyses. To address the challenge, we propose a novel unified learning framework called "Feature and Label Recovery" (FLR) to combat the hybrid noise from the perspective of data recovery, where we concurrently reconstruct both the feature matrix and the label matrix of input data. Specifically, the clean feature matrix is discovered by the low-rank approximation, and the ground-truth label matrix is embedded based on the recovered features with a nuclear norm regularization. Meanwhile, the feature noise and label noise are characterized by their respective adaptive matrix norms to satisfy the corresponding maximum likelihood. As this framework leads to a non-convex optimization problem, we develop the non-convex Alternating Direction Method of Multipliers (ADMM) with the convergence guarantee to solve our learning objective. We also provide the theoretical analysis to show that the generalization error of FLR can be upper-bounded in the presence of hybrid noise. Experimental results on several typical benchmark datasets clearly demonstrate the superiority of our proposed method over the state-of-the-art robust learning approaches for various noises.

Robust Learning under Hybrid Noise

TL;DR

Hybrid noise, arising from simultaneous feature and label corruption, is addressed by FLR, a unified data-recovery framework that recovers clean features and labels via a double low-rank formulation. The method couples low-rank feature-to-label projection with adaptive noise regularizers and solves the resulting non-convex problem using ADMM, with convergence guarantees and a generalization bound based on Rademacher complexity. Empirical results on UCI benchmarks, CIFAR-10, and CIFAR-10N demonstrate that FLR outperforms state-of-the-art robust learning methods across diverse noise regimes. The work provides a principled, scalable approach for robust learning in real-world noisy environments and suggests avenues for integrating deep representations to model nonlinearity while maintaining data-recovery guarantees.

Abstract

Feature noise and label noise are ubiquitous in practical scenarios, which pose great challenges for training a robust machine learning model. Most previous approaches usually deal with only a single problem of either feature noise or label noise. However, in real-world applications, hybrid noise, which contains both feature noise and label noise, is very common due to the unreliable data collection and annotation processes. Although some results have been achieved by a few representation learning based attempts, this issue is still far from being addressed with promising performance and guaranteed theoretical analyses. To address the challenge, we propose a novel unified learning framework called "Feature and Label Recovery" (FLR) to combat the hybrid noise from the perspective of data recovery, where we concurrently reconstruct both the feature matrix and the label matrix of input data. Specifically, the clean feature matrix is discovered by the low-rank approximation, and the ground-truth label matrix is embedded based on the recovered features with a nuclear norm regularization. Meanwhile, the feature noise and label noise are characterized by their respective adaptive matrix norms to satisfy the corresponding maximum likelihood. As this framework leads to a non-convex optimization problem, we develop the non-convex Alternating Direction Method of Multipliers (ADMM) with the convergence guarantee to solve our learning objective. We also provide the theoretical analysis to show that the generalization error of FLR can be upper-bounded in the presence of hybrid noise. Experimental results on several typical benchmark datasets clearly demonstrate the superiority of our proposed method over the state-of-the-art robust learning approaches for various noises.
Paper Structure (26 sections, 6 theorems, 30 equations, 6 figures, 3 tables)

This paper contains 26 sections, 6 theorems, 30 equations, 6 figures, 3 tables.

Key Result

Theorem 1

Let $\{\Gamma_t =$$(\bm{X}_t, \bm{Z}_t,$$\bm{B}_t, \bm{J}_t,$$\bm{K}_t,$$\bm{E}_{l, t},$$\bm{E}_{f, t},$$\{\bm{M}_{i,t}\}_{i=1}^5 )\}_{t = 1}^\infty$ be the sequence generated by Algorithm algorithm: the algorithm for feature and label noise. Assume that $\lim\limits_{t \rightarrow +\infty} \mu_t(\b

Figures (6)

  • Figure 1: The problem setting of our proposed hybrid noise learning. Data with Feature Noise: all labels are correct, yet the features of examples are corrupted. Data with Label Noise: the features of all examples are clean, while some labels are incorrect. We consider a more challenging case, namely Data with Hybrid Noise: both features and labels of training examples are noisy. The noisy label is indicated in red.
  • Figure 2: The proposed unified learning paradigm for the hybrid noise removal. The clean feature matrix $\bm{X}$ and correct label matrix $\bm{Y}$ are recovered from noisy data $\tilde{\bm{X}}$, $\tilde{\bm{Y}}$, respectively. Meanwhile, the true label matrix $\bm{Y}$ is embedded by $\bm{X}$ via a low-rank projection $\bm{Z}$, where the feature error matrix is $\bm{E}_f$ and the label error matrix is $\bm{E}_l$, respectively.
  • Figure 3: The experimental results on CIFAR-10 dataset in various noise scenarios.
  • Figure 4: Ablation of FLR on "Aggre.".
  • Figure 5: Parametric sensitivity of FLR on "Aggre.".
  • ...and 1 more figures

Theorems & Definitions (8)

  • Theorem 1
  • Theorem 2
  • Lemma 3
  • proof
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • proof