Table of Contents
Fetching ...

High-dimensional Learning with Noisy Labels

Aymane El Firdoussi, Mohamed El Amine Seddik

TL;DR

This work addresses high-dimensional binary classification with class-conditional label noise by leveraging random matrix theory (RMT) to analyze a Labels-Perturbed Classifier (LPC), a ridge-based method with a parameterized, noise-aware loss. The authors derive deterministic equivalents for the resolvent and show that, in the regime where both data dimension $p$ and sample size $n$ are large and comparable, the LPC decision statistic is asymptotically Gaussian with explicitly computable mean $m_{\rho}$ and variance $\nu_{\rho}-m_{\rho}^2$, enabling an optimal choice of the perturbation parameters $\rho_\pm$. They prove that the conventional unbiased classifier can be sub-optimal in high dimensions and provide a closed-form $\rho_+^*$ that maximizes test accuracy, alongside a practical procedure to estimate noise rates $\varepsilon_\pm$. Empirical validation on real datasets shows the Optimized LPC consistently improves performance under label noise, closely approaching an oracle trained on correct labels, thereby offering a robust, theoretically grounded approach for high-dimensional noisy-label learning.

Abstract

This paper provides theoretical insights into high-dimensional binary classification with class-conditional noisy labels. Specifically, we study the behavior of a linear classifier with a label noisiness aware loss function, when both the dimension of data $p$ and the sample size $n$ are large and comparable. Relying on random matrix theory by supposing a Gaussian mixture data model, the performance of the linear classifier when $p,n\to \infty$ is shown to converge towards a limit, involving scalar statistics of the data. Importantly, our findings show that the low-dimensional intuitions to handle label noise do not hold in high-dimension, in the sense that the optimal classifier in low-dimension dramatically fails in high-dimension. Based on our derivations, we design an optimized method that is shown to be provably more efficient in handling noisy labels in high dimensions. Our theoretical conclusions are further confirmed by experiments on real datasets, where we show that our optimized approach outperforms the considered baselines.

High-dimensional Learning with Noisy Labels

TL;DR

This work addresses high-dimensional binary classification with class-conditional label noise by leveraging random matrix theory (RMT) to analyze a Labels-Perturbed Classifier (LPC), a ridge-based method with a parameterized, noise-aware loss. The authors derive deterministic equivalents for the resolvent and show that, in the regime where both data dimension and sample size are large and comparable, the LPC decision statistic is asymptotically Gaussian with explicitly computable mean and variance , enabling an optimal choice of the perturbation parameters . They prove that the conventional unbiased classifier can be sub-optimal in high dimensions and provide a closed-form that maximizes test accuracy, alongside a practical procedure to estimate noise rates . Empirical validation on real datasets shows the Optimized LPC consistently improves performance under label noise, closely approaching an oracle trained on correct labels, thereby offering a robust, theoretically grounded approach for high-dimensional noisy-label learning.

Abstract

This paper provides theoretical insights into high-dimensional binary classification with class-conditional noisy labels. Specifically, we study the behavior of a linear classifier with a label noisiness aware loss function, when both the dimension of data and the sample size are large and comparable. Relying on random matrix theory by supposing a Gaussian mixture data model, the performance of the linear classifier when is shown to converge towards a limit, involving scalar statistics of the data. Importantly, our findings show that the low-dimensional intuitions to handle label noise do not hold in high-dimension, in the sense that the optimal classifier in low-dimension dramatically fails in high-dimension. Based on our derivations, we design an optimized method that is shown to be provably more efficient in handling noisy labels in high dimensions. Our theoretical conclusions are further confirmed by experiments on real datasets, where we show that our optimized approach outperforms the considered baselines.
Paper Structure (35 sections, 9 theorems, 96 equations, 9 figures, 1 table)

This paper contains 35 sections, 9 theorems, 96 equations, 9 figures, 1 table.

Key Result

Lemma 3.4

Under the high-dimensional regime, when $p,n\to \infty$ with $\frac{p}{n} \to \eta \in (0, \infty)$ and assuming $\Vert {\bm{\mu}} \Vert = {\mathcal{O}}(1)$. A deterministic equivalent for ${\mathbf{Q}}\equiv{\mathbf{Q}}(\gamma)$ as defined in w_imp is given by:

Figures (9)

  • Figure 1: Distribution of the decision function ${\bm{w}}_\rho^\top {\bm{x}}$ of different variants of LPC for $n = 5000$, $\pi_1 = \frac{1}{3}$, $\varepsilon_+ = 0.4$, $\varepsilon_- = 0.3$, $\Vert {\bm{\mu}} \Vert = 2$, $\gamma = 0.1$, $p = 50$ (first row) and $p = 1000$ (second row). The theoretical Gaussian distributions are predicted as per Theorem \ref{['thm_main']}. Note that the variance of the decision function for the unbiased classifier increases with the dimension yielding poor accuracy.
  • Figure 2: Test performance (accuracy and risk) of different LPC variants in terms of the positive noise rate $\varepsilon_+$. We considered $n = 100$, $\pi_1 = \frac{1}{3}$, $\varepsilon_- = 0.2$, $\Vert {\bm{\mu}} \Vert = 2$, $\gamma = 10$, $\rho_+ = 0.2$ and $\rho_- = 0$ (for LPC in blue). The theoretical curves are obtained as per Proposition \ref{['prop:test-accuracy']}. We notice that the effect of label noise is more important in high-dimension, i.e., large values of $\eta$.
  • Figure 3: Test accuracy of LPC by fixing $\rho_-=0$ and varying $\rho_+$. We considered $n = 1000$, $\pi_1 = 0.3$, $\Vert {\bm{\mu}} \Vert = 2$, $\varepsilon_+ = 0.4$, $\varepsilon_- = 0.3$ and optimal $\gamma$. We notice that the test accuracy is maximized at $\rho_+^*$ yielding better accuracy compared with the unbiased approach. Note that for small values of $\eta$, i.e., for low dimensions, the test accuracy becomes flat in terms of $\rho_+$ and in the limit $\eta\to 0$ the maximizer $\rho_+^*$ is not identifiable as discussed in Remark \ref{['remark_infinie_samples']}.
  • Figure 4: Histogram of the decision function of different LPC variants on the books dataset blitzer2007biographies, along with the theoretical distribution as predicted by Theorem \ref{['thm_main']}. We considered $n = 1600$, $p = 400$, $\pi_1 = 0.3$, $\varepsilon_+ = 0.4$, $\varepsilon_- = 0.3$ and optimal $\gamma$.
  • Figure 5: Empirical versus theoretical test accuracy as per Proposition \ref{['prop:test-accuracy']} for different variants of LPC. We used ($n, p = 2000, 20$) for Low-dimensional plot ($n, p = 200, 200$) and for High-dimensional experiment, $\pi_1 = 0.3$, $\varepsilon_+ = 0.4$, $\varepsilon_- = 0.3$ and varied $\gamma$.
  • ...and 4 more figures

Theorems & Definitions (17)

  • Remark 3.1: On the data model
  • Definition 3.2: Resolvent
  • Definition 3.3: Deterministic equivalent hachem2007deterministic
  • Lemma 3.4: Deterministic equivalent of the resolvent
  • Theorem 4.2: Gaussianity of LPC
  • Proposition 4.3: Asymptotic test accuracy & risk of LPC
  • Remark 4.4: On the relevance of the RMT analysis
  • Lemma A.1: Resolvent identity
  • Lemma A.2: Sherman-Morisson
  • Lemma A.3: Relevant Identities
  • ...and 7 more