High-dimensional Learning with Noisy Labels
Aymane El Firdoussi, Mohamed El Amine Seddik
TL;DR
This work addresses high-dimensional binary classification with class-conditional label noise by leveraging random matrix theory (RMT) to analyze a Labels-Perturbed Classifier (LPC), a ridge-based method with a parameterized, noise-aware loss. The authors derive deterministic equivalents for the resolvent and show that, in the regime where both data dimension $p$ and sample size $n$ are large and comparable, the LPC decision statistic is asymptotically Gaussian with explicitly computable mean $m_{\rho}$ and variance $\nu_{\rho}-m_{\rho}^2$, enabling an optimal choice of the perturbation parameters $\rho_\pm$. They prove that the conventional unbiased classifier can be sub-optimal in high dimensions and provide a closed-form $\rho_+^*$ that maximizes test accuracy, alongside a practical procedure to estimate noise rates $\varepsilon_\pm$. Empirical validation on real datasets shows the Optimized LPC consistently improves performance under label noise, closely approaching an oracle trained on correct labels, thereby offering a robust, theoretically grounded approach for high-dimensional noisy-label learning.
Abstract
This paper provides theoretical insights into high-dimensional binary classification with class-conditional noisy labels. Specifically, we study the behavior of a linear classifier with a label noisiness aware loss function, when both the dimension of data $p$ and the sample size $n$ are large and comparable. Relying on random matrix theory by supposing a Gaussian mixture data model, the performance of the linear classifier when $p,n\to \infty$ is shown to converge towards a limit, involving scalar statistics of the data. Importantly, our findings show that the low-dimensional intuitions to handle label noise do not hold in high-dimension, in the sense that the optimal classifier in low-dimension dramatically fails in high-dimension. Based on our derivations, we design an optimized method that is shown to be provably more efficient in handling noisy labels in high dimensions. Our theoretical conclusions are further confirmed by experiments on real datasets, where we show that our optimized approach outperforms the considered baselines.
