Investigating the Impact of Model Width and Density on Generalization in Presence of Label Noise

Yihao Xue; Kyle Whitecross; Baharan Mirzasoleiman

Investigating the Impact of Model Width and Density on Generalization in Presence of Label Noise

Yihao Xue, Kyle Whitecross, Baharan Mirzasoleiman

TL;DR

The study uncovers a final ascent in generalization loss induced by label noise as model width grows, challenging the notion that larger models always improve performance. Through a tractable random-feature ridge regression theory, it shows that a noise-driven increase in test loss variance drives this ascent, with a closed-form expression for the noise variance term that scales with the noise-to-sample-size ratio $\kappa$. Extending the analysis to model density reveals that intermediate density (i.e., wider but sparser networks) can yield better generalization under label noise, and density changes can behave differently from plain $\ell_2$ regularization. Empirical results on MNIST, CIFAR-10/100, and Stanford Cars across various architectures and robust training methods corroborate the theory, showing that final ascent is common under noise and can be amplified by regularization or robustness techniques, while increased sample size mitigates it. The findings have practical implications for designing architectures under noisy labeling and highlight the potential of wider-sparser models in noisy regimes.

Abstract

Increasing the size of overparameterized neural networks has been a key in achieving state-of-the-art performance. This is captured by the double descent phenomenon, where the test loss follows a decreasing-increasing-decreasing pattern (or sometimes monotonically decreasing) as model width increases. However, the effect of label noise on the test loss curve has not been fully explored. In this work, we uncover an intriguing phenomenon where label noise leads to a \textit{final ascent} in the originally observed double descent curve. Specifically, under a sufficiently large noise-to-sample-size ratio, optimal generalization is achieved at intermediate widths. Through theoretical analysis, we attribute this phenomenon to the shape transition of test loss variance induced by label noise. Furthermore, we extend the final ascent phenomenon to model density and provide the first theoretical characterization showing that reducing density by randomly dropping trainable parameters improves generalization under label noise. We also thoroughly examine the roles of regularization and sample size. Surprisingly, we find that larger $\ell_2$ regularization and robust learning methods against label noise exacerbate the final ascent. We confirm the validity of our findings through extensive experiments on ReLu networks trained on MNIST, ResNets/ViTs trained on CIFAR-10/100, and InceptionResNet-v2 trained on Stanford Cars with real-world noisy labels.

Investigating the Impact of Model Width and Density on Generalization in Presence of Label Noise

TL;DR

. Extending the analysis to model density reveals that intermediate density (i.e., wider but sparser networks) can yield better generalization under label noise, and density changes can behave differently from plain

regularization. Empirical results on MNIST, CIFAR-10/100, and Stanford Cars across various architectures and robust training methods corroborate the theory, showing that final ascent is common under noise and can be amplified by regularization or robustness techniques, while increased sample size mitigates it. The findings have practical implications for designing architectures under noisy labeling and highlight the potential of wider-sparser models in noisy regimes.

Abstract

regularization and robust learning methods against label noise exacerbate the final ascent. We confirm the validity of our findings through extensive experiments on ReLu networks trained on MNIST, ResNets/ViTs trained on CIFAR-10/100, and InceptionResNet-v2 trained on Stanford Cars with real-world noisy labels.

Paper Structure (37 sections, 4 theorems, 22 equations, 34 figures, 1 table)

This paper contains 37 sections, 4 theorems, 22 equations, 34 figures, 1 table.

Introduction
Related Work
Theoretical Analysis of Random Feature Ridge Regression
Effect of Width: the Final Ascent
Beyond Width: Effect of Density
Advantage of Wider but Sparser Models
Experiments
Final Ascent in Random Feature Ridge Regression with Different $\frac{n}{d}$'s
Final Ascent in NNs: Effect of Width
Final Ascent in NNs: Effect of Density
Advantage of Wider but Sparser Models
Final Ascent in Neural Networks: Robust Algorithms
Conclusion and Discussion
Theoretical Results
Bias-Variance Decomposition of the MSE Loss in Section \ref{['sec:theory_width']}
...and 22 more sections

Key Result

Theorem 3.1

For a 2-layer linear network with $p$ hidden neurons and a random first layer, consider learning the second layer by ridge regression with regularizer $\lambda$ on $n$ training examples with feature dimension $d$, and label noise with variance $\sigma$. Let $\lambda = \frac{n}{d}\lambda_0$ and $\sig

Figures (34)

Figure 1: Decomposition of test loss. $\textbf{Risk}= \textbf{Bias}^2 +\textbf{Variance}$. $\textbf{Bias}^2$ always monotonically decreases. $\textbf{Variance}$ exhibits a transition from a unimodal shape to an increasing-decreasing-increasing pattern as noise increases, leading to the final ascent in test loss.
Figure 2: Decomposition of variance. $\textbf{Variance} = \textbf{Variance}_{\text{clean}} + \textbf{Variance}_{\text{noise}}$. $\textbf{Variance}_{\text{clean}}$ is always unimodal. $\textbf{Variance}_{\text{noise}}$ monotonically increases with width, and its scale grows with noise level, leading to the increasing-decreasing-increasing pattern of $\textbf{Variance}$ at sufficient noise.
Figure 3: With stronger regularization, the optimal width increases and achieves lower loss, making the final ascent more pronounced.
Figure 4: (a), (b): The risk curve changes from decreasing to U-shaped as the noise-to-sample-size ratio ($\kappa$) increases, for different values of width ($p$). (c) The total variance changes from unimodal to increasing as $\kappa$ increases. (d) Under lower density, the optimal width tends to be larger, and achieves lower test loss compared to the optimal width at higher density.
Figure 5: Final ascent in random feature ridge regression with Different $\frac{n}{d}$ Ratios. We plot the test loss while fixing $d=100$ and $\lambda=0.2$, and varying $\sigma^2$. Legends show the values of $\sigma^2$, and titles show the values of $n/d$.
...and 29 more figures

Theorems & Definitions (6)

Theorem 3.1
Theorem 3.2
Lemma A.1
proof
Corollary A.2
proof

Investigating the Impact of Model Width and Density on Generalization in Presence of Label Noise

TL;DR

Abstract

Investigating the Impact of Model Width and Density on Generalization in Presence of Label Noise

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (34)

Theorems & Definitions (6)