Table of Contents
Fetching ...

Just How Flexible are Neural Networks in Practice?

Ravid Shwartz-Ziv, Micah Goldblum, Arpit Bansal, C. Bayan Bruss, Yann LeCun, Andrew Gordon Wilson

TL;DR

This work interrogates how flexible neural networks are in practice by measuring the Effective Model Complexity ($EMC$), the largest sample size a model can fit under realistic training loops. It shows that standard optimizers often locate minima that fit far fewer samples than the parameter count would allow, that CNNs are more parameter-efficient than MLPs and ViTs even on randomly labeled data, and that SGD can fit more data than full-batch GD, challenging common regularization assumptions. The study also finds that the ability to fit correctly labeled data relative to randomly labeled data predicts generalization, and that ReLU activations raise capacity beyond their intended purpose. Additionally, reparameterization techniques such as subspace training and low-precision representations can substantially boost parameter efficiency, suggesting neural networks are often parameter-wasteful and that careful reparameterization can improve practical performance without sacrificing expressiveness.

Abstract

It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters, underpinning notions of overparameterized and underparameterized models. In practice, however, we only find solutions accessible via our training procedure, including the optimizer and regularizers, limiting flexibility. Moreover, the exact parameterization of the function class, built into an architecture, shapes its loss surface and impacts the minima we find. In this work, we examine the ability of neural networks to fit data in practice. Our findings indicate that: (1) standard optimizers find minima where the model can only fit training sets with significantly fewer samples than it has parameters; (2) convolutional networks are more parameter-efficient than MLPs and ViTs, even on randomly labeled data; (3) while stochastic training is thought to have a regularizing effect, SGD actually finds minima that fit more training data than full-batch gradient descent; (4) the difference in capacity to fit correctly labeled and incorrectly labeled samples can be predictive of generalization; (5) ReLU activation functions result in finding minima that fit more data despite being designed to avoid vanishing and exploding gradients in deep architectures.

Just How Flexible are Neural Networks in Practice?

TL;DR

This work interrogates how flexible neural networks are in practice by measuring the Effective Model Complexity (), the largest sample size a model can fit under realistic training loops. It shows that standard optimizers often locate minima that fit far fewer samples than the parameter count would allow, that CNNs are more parameter-efficient than MLPs and ViTs even on randomly labeled data, and that SGD can fit more data than full-batch GD, challenging common regularization assumptions. The study also finds that the ability to fit correctly labeled data relative to randomly labeled data predicts generalization, and that ReLU activations raise capacity beyond their intended purpose. Additionally, reparameterization techniques such as subspace training and low-precision representations can substantially boost parameter efficiency, suggesting neural networks are often parameter-wasteful and that careful reparameterization can improve practical performance without sacrificing expressiveness.

Abstract

It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters, underpinning notions of overparameterized and underparameterized models. In practice, however, we only find solutions accessible via our training procedure, including the optimizer and regularizers, limiting flexibility. Moreover, the exact parameterization of the function class, built into an architecture, shapes its loss surface and impacts the minima we find. In this work, we examine the ability of neural networks to fit data in practice. Our findings indicate that: (1) standard optimizers find minima where the model can only fit training sets with significantly fewer samples than it has parameters; (2) convolutional networks are more parameter-efficient than MLPs and ViTs, even on randomly labeled data; (3) while stochastic training is thought to have a regularizing effect, SGD actually finds minima that fit more training data than full-batch gradient descent; (4) the difference in capacity to fit correctly labeled and incorrectly labeled samples can be predictive of generalization; (5) ReLU activation functions result in finding minima that fit more data despite being designed to avoid vanishing and exploding gradients in deep architectures.
Paper Structure (21 sections, 21 figures)

This paper contains 21 sections, 21 figures.

Figures (21)

  • Figure 1: Left: easier tasks tend to have higher EMC. EMC across datasets and data modalities. The tabular data sets (Forest, Income, CoverType), which are easier to learn, have the highest EMC compared to vision datasets. The dashed black line is the diagonal. ImageNet is the hardest dataset to learn. Right: the difference in EDC on the original and random labels predicts generalization. EMC improvement as a function of the parameter count for CIFAR-100.
  • Figure 2: CNNs fit more semantically labeled samples than they have parameters due to their superior image classification inductive bias, whereas MLPs cannot. EMC as a function of the number of parameters for semantic labels vs. random input and labels for MLPs (a) and CNNs (b). Experiments performed on ImageNet-20MS. Error bars represent one standard error over 5 trials.
  • Figure 3: The effect of the number of labels and optimizers on capacity. Average logarithm of EMC across different model sizes of CNNs on CIFAR-100 for original and random labels varying numbers of classes (a) and for different optimizers (b). Error bars are standard error over 5 trials.
  • Figure 4: The effect of the scaling strategy and the architecture on the EMC . (a) Scaling laws for the EMC as a function of parameters counts for CNN. (b) Average logarithm of EMC across parameter counts for different architectures using original and random labels. On ImageNet-20MS. Error bars represent one standard error over 5 trials.
  • Figure 5: Scaling laws - EMC as a function of the number of parameters for randomly labeled ImageNet-20MS for VIT
  • ...and 16 more figures