Feature-Learning Networks Are Consistent Across Widths At Realistic Scales
Nikhil Vyas, Alexander Atanasov, Blake Bordelon, Depen Morwani, Sabarish Sainathan, Cengiz Pehlevan
TL;DR
The paper investigates how network width affects dynamics under the $\mu$P$ parameterization, testing vision and language models to determine whether realistic widths can be described by an infinite-width feature-learning limit. It demonstrates strong width-consistency in online training across losses, predictions, representations, and dynamical phenomena, with convergence occurring at widths within practical ranges and differing by task complexity. When training becomes offline or tasks are harder, finite-width deviations arise due to initialization-induced variance and a bias of narrower widths; ensembling reduces variance but does not fully recover the infinite-width behavior. A spectral analysis suggests the finite-width bias stems from deformation of eigenfunctions of the ensemble NTK, offering a mechanistic explanation that aligns with after-kernel observations in CIFAR-5m. Overall, the work argues that infinite-width feature-learning models provide a robust framework for understanding realistic networks, while highlighting task-dependent finite-width corrections and the need to consider spectral dynamics in their analysis; the authors also plan to release code to enable reproducibility.
Abstract
We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets. Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training. For simple tasks such as CIFAR-5m this holds throughout training for networks of realistic widths. We also show that structural properties of the models, including internal representations, preactivation distributions, edge of stability phenomena, and large learning rate effects are consistent across large widths. This motivates the hypothesis that phenomena seen in realistic models can be captured by infinite-width, feature-learning limits. For harder tasks (such as ImageNet and language modeling), and later training times, finite-width deviations grow systematically. Two distinct effects cause these deviations across widths. First, the network output has initialization-dependent variance scaling inversely with width, which can be removed by ensembling networks. We observe, however, that ensembles of narrower networks perform worse than a single wide network. We call this the bias of narrower width. We conclude with a spectral perspective on the origin of this finite-width bias.
