Just How Flexible are Neural Networks in Practice?
Ravid Shwartz-Ziv, Micah Goldblum, Arpit Bansal, C. Bayan Bruss, Yann LeCun, Andrew Gordon Wilson
TL;DR
This work interrogates how flexible neural networks are in practice by measuring the Effective Model Complexity ($EMC$), the largest sample size a model can fit under realistic training loops. It shows that standard optimizers often locate minima that fit far fewer samples than the parameter count would allow, that CNNs are more parameter-efficient than MLPs and ViTs even on randomly labeled data, and that SGD can fit more data than full-batch GD, challenging common regularization assumptions. The study also finds that the ability to fit correctly labeled data relative to randomly labeled data predicts generalization, and that ReLU activations raise capacity beyond their intended purpose. Additionally, reparameterization techniques such as subspace training and low-precision representations can substantially boost parameter efficiency, suggesting neural networks are often parameter-wasteful and that careful reparameterization can improve practical performance without sacrificing expressiveness.
Abstract
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters, underpinning notions of overparameterized and underparameterized models. In practice, however, we only find solutions accessible via our training procedure, including the optimizer and regularizers, limiting flexibility. Moreover, the exact parameterization of the function class, built into an architecture, shapes its loss surface and impacts the minima we find. In this work, we examine the ability of neural networks to fit data in practice. Our findings indicate that: (1) standard optimizers find minima where the model can only fit training sets with significantly fewer samples than it has parameters; (2) convolutional networks are more parameter-efficient than MLPs and ViTs, even on randomly labeled data; (3) while stochastic training is thought to have a regularizing effect, SGD actually finds minima that fit more training data than full-batch gradient descent; (4) the difference in capacity to fit correctly labeled and incorrectly labeled samples can be predictive of generalization; (5) ReLU activation functions result in finding minima that fit more data despite being designed to avoid vanishing and exploding gradients in deep architectures.
