Table of Contents
Fetching ...

The Effect of Label Noise on the Information Content of Neural Representations

Ali Hussaini Umar, Franky Kevin Nando Tezoh, Jean Barbier, Santiago Acevedo, Alessandro Laio

TL;DR

This work investigates how label noise affects the information content of neural representations by introducing Information Imbalance (II), an asymmetric, scalable proxy related to restricted mutual information. Using FCNNs on MNIST and CNNs on CIFAR-10, the authors show a double-descent pattern in II as model width increases and find that label noise can enhance the informativeness of hidden representations in underparameterized regimes, while overparameterized networks become robust to noise. They analytically bound II in Gaussian settings and demonstrate that noise degrades information transfer primarily in the last layer, with hidden representations remaining comparatively stable; random-label training yields representations not equivalent to random features. Overall, II offers a task-agnostic lens for diagnosing generalization in neural networks and highlights nuanced regime-dependent effects of label noise on learned representations.

Abstract

In supervised classification tasks, models are trained to predict a label for each data point. In real-world datasets, these labels are often noisy due to annotation errors. While the impact of label noise on the performance of deep learning models has been widely studied, its effects on the networks' hidden representations remain poorly understood. We address this gap by systematically comparing hidden representations using the Information Imbalance, a computationally efficient proxy of conditional mutual information. Through this analysis, we observe that the information content of the hidden representations follows a double descent as a function of the number of network parameters, akin to the behavior of the test error. We further demonstrate that in the underparameterized regime, representations learned with noisy labels are more informative than those learned with clean labels, while in the overparameterized regime, these representations are equally informative. Our results indicate that the representations of overparameterized networks are robust to label noise. We also found that the information imbalance between the penultimate and pre-softmax layers decreases with cross-entropy loss in the overparameterized regime. This offers a new perspective on understanding generalization in classification tasks. Extending our analysis to representations learned from random labels, we show that these perform worse than random features. This indicates that training on random labels drives networks much beyond lazy learning, as weights adapt to encode labels information.

The Effect of Label Noise on the Information Content of Neural Representations

TL;DR

This work investigates how label noise affects the information content of neural representations by introducing Information Imbalance (II), an asymmetric, scalable proxy related to restricted mutual information. Using FCNNs on MNIST and CNNs on CIFAR-10, the authors show a double-descent pattern in II as model width increases and find that label noise can enhance the informativeness of hidden representations in underparameterized regimes, while overparameterized networks become robust to noise. They analytically bound II in Gaussian settings and demonstrate that noise degrades information transfer primarily in the last layer, with hidden representations remaining comparatively stable; random-label training yields representations not equivalent to random features. Overall, II offers a task-agnostic lens for diagnosing generalization in neural networks and highlights nuanced regime-dependent effects of label noise on learned representations.

Abstract

In supervised classification tasks, models are trained to predict a label for each data point. In real-world datasets, these labels are often noisy due to annotation errors. While the impact of label noise on the performance of deep learning models has been widely studied, its effects on the networks' hidden representations remain poorly understood. We address this gap by systematically comparing hidden representations using the Information Imbalance, a computationally efficient proxy of conditional mutual information. Through this analysis, we observe that the information content of the hidden representations follows a double descent as a function of the number of network parameters, akin to the behavior of the test error. We further demonstrate that in the underparameterized regime, representations learned with noisy labels are more informative than those learned with clean labels, while in the overparameterized regime, these representations are equally informative. Our results indicate that the representations of overparameterized networks are robust to label noise. We also found that the information imbalance between the penultimate and pre-softmax layers decreases with cross-entropy loss in the overparameterized regime. This offers a new perspective on understanding generalization in classification tasks. Extending our analysis to representations learned from random labels, we show that these perform worse than random features. This indicates that training on random labels drives networks much beyond lazy learning, as weights adapt to encode labels information.

Paper Structure

This paper contains 21 sections, 52 equations, 5 figures.

Figures (5)

  • Figure 1: left: Information Imbalance (II) and its lower bound between the representation $x$ and $y$ representation of a Gaussian random variable $z$ as a function of correlation parameter $\rho$. Right: II and its lower bound for predicting the observed variable $\textbf{y}$ given the ground truth signal $\textbf{x}$ of a Gaussian denoising model as a function of signal-to-noise ratio $\lambda$, the dimension of the signal is $100$. The figure inside the main figure shows the difference between the II and its lower bound. The II curve is the average result of $30$ experiments computed on $10^3$ samples.
  • Figure 2: Double descent phenomenon in the relative information content of statistically independent representations. The top panels refer to results for single-layer FCNNs trained on the MNIST dataset, and the bottom panels correspond to results for CNNs trained on the CIFAR-10 dataset. All the curves in all the panels are plotted as a function of network parameters. Panels (a) and (d) show the generalization error for different levels of label noise. Panels (b) and (e) show the Information Imbalance between the hidden representations of identical networks trained on the same dataset with independent initializations. Panels (c) and (f) show the Information Imbalance between the pre-softmax representations (logits that provide the network output) of these networks. The Information Imbalance was computed using $2\times10^3$ test samples, with each curve representing an average over $90$ pairs of representations.
  • Figure 3: Information loss between penultimate and last layer pre-softmax representations of overparameterized networks: The top panels display results for single-layer FCNNs trained on the MNIST dataset, while the bottom panels show the results for CNNs trained on the CIFAR-10 dataset. All the curves in all the panels are plotted as a function of the number of network parameters. Panels (a) and (d) present the Information Imbalance (II) curves between hidden representations of identical networks trained with and without label noise. Panels (b) and (e) show the difference between the two IIs in panels (a) and (d), respectively. i.e, $\Delta(\cdot\% \rightarrow 0\%) - \Delta(0\% \rightarrow \cdot\%)$ . Panels (c) and (f) show the II for predicting the last layer preactivations representation given the hidden (penultimate) representation of a network trained with a certain label noise ratio. For FCNN (respectively CNN), the curves are averaged over $20$ (respectively, $10$) independent networks trained with different initializations.
  • Figure 4: Performance of trained network against the information imbalance of its hidden representation predicting its pre-softmax representation. Left: Results of a single-layer FCNN with 100 hidden units trained on the MNIST classification task. Right: Results for a vanilla CNN with a 100-width parameter trained on the CIFAR-10 classification task. Each point is the average result of $20$ networks trained with independent initialization.
  • Figure 5: Effect of label Memorization in representation: The first row shows the evolution of information imbalances between two representations as network parameters $(P)$ and the training sample size $(N)$ are scaled proportionally by a scaling factor $c$, while their ratio $\alpha =P/N$ remains fixed. Figures (a), (b), and (c) correspond to cases where $\alpha \approx 5$,$\alpha \approx 20$, and $\alpha \approx 30$, respectively. The blue curves represent results between random features (RF) and representation learned with clean labels (CLEAN), while the orange curves represent results between RF and representation learned with random labels (NOISE). Initially, $N_0=4000$, and $P_0=\alpha N_0$ (smallest circle, $c=1$), and both $N$ and $P$ are scaled up proportionally to the largest circle ($c=10$), where $N=10N_0$ and $P=\alpha 10N_0$. The second row displays the test accuracy of a classifier trained on a clean task using the RF, CLEAN, and NOISE representations. Panels (e), (f), and (h) correspond to the representations in Panels (a), (b), and (c), respectively. All the curves are average results of 20 independent experiments on the MNIST data set.