Investigating the Impact of Model Width and Density on Generalization in Presence of Label Noise
Yihao Xue, Kyle Whitecross, Baharan Mirzasoleiman
TL;DR
The study uncovers a final ascent in generalization loss induced by label noise as model width grows, challenging the notion that larger models always improve performance. Through a tractable random-feature ridge regression theory, it shows that a noise-driven increase in test loss variance drives this ascent, with a closed-form expression for the noise variance term that scales with the noise-to-sample-size ratio $\kappa$. Extending the analysis to model density reveals that intermediate density (i.e., wider but sparser networks) can yield better generalization under label noise, and density changes can behave differently from plain $\ell_2$ regularization. Empirical results on MNIST, CIFAR-10/100, and Stanford Cars across various architectures and robust training methods corroborate the theory, showing that final ascent is common under noise and can be amplified by regularization or robustness techniques, while increased sample size mitigates it. The findings have practical implications for designing architectures under noisy labeling and highlight the potential of wider-sparser models in noisy regimes.
Abstract
Increasing the size of overparameterized neural networks has been a key in achieving state-of-the-art performance. This is captured by the double descent phenomenon, where the test loss follows a decreasing-increasing-decreasing pattern (or sometimes monotonically decreasing) as model width increases. However, the effect of label noise on the test loss curve has not been fully explored. In this work, we uncover an intriguing phenomenon where label noise leads to a \textit{final ascent} in the originally observed double descent curve. Specifically, under a sufficiently large noise-to-sample-size ratio, optimal generalization is achieved at intermediate widths. Through theoretical analysis, we attribute this phenomenon to the shape transition of test loss variance induced by label noise. Furthermore, we extend the final ascent phenomenon to model density and provide the first theoretical characterization showing that reducing density by randomly dropping trainable parameters improves generalization under label noise. We also thoroughly examine the roles of regularization and sample size. Surprisingly, we find that larger $\ell_2$ regularization and robust learning methods against label noise exacerbate the final ascent. We confirm the validity of our findings through extensive experiments on ReLu networks trained on MNIST, ResNets/ViTs trained on CIFAR-10/100, and InceptionResNet-v2 trained on Stanford Cars with real-world noisy labels.
