Table of Contents
Fetching ...

Bias of Stochastic Gradient Descent or the Architecture: Disentangling the Effects of Overparameterization of Neural Networks

Amit Peleg, Matthias Hein

TL;DR

The paper tackles why heavily overparameterized neural networks generalize, by disentangling SGD's implicit bias from architectural bias in a low-sample, binary-classification setting. It compares SGD to Guess-and-Check networks that randomly sample zero-training-error networks, using Lipschitz-based geometric-margin normalization to compare across architectures and parameter scales. The main findings show that increasing width improves SGD generalization due to its implicit bias, while increasing depth harms generalization due to architectural bias, with G&C largely insensitive to width but negatively affected by depth. These results illuminate when optimization dynamics versus architectural choices drive generalization, offering guidance for designing efficient architectures and training protocols. The approach hinges on zero-training-error networks, controlled initializations, and Lipschitz normalization to robustly assess generalization across models and datasets such as MNIST and CIFAR-10.

Abstract

Neural networks typically generalize well when fitting the data perfectly, even though they are heavily overparameterized. Many factors have been pointed out as the reason for this phenomenon, including an implicit bias of stochastic gradient descent (SGD) and a possible simplicity bias arising from the neural network architecture. The goal of this paper is to disentangle the factors that influence generalization stemming from optimization and architectural choices by studying random and SGD-optimized networks that achieve zero training error. We experimentally show, in the low sample regime, that overparameterization in terms of increasing width is beneficial for generalization, and this benefit is due to the bias of SGD and not due to an architectural bias. In contrast, for increasing depth, overparameterization is detrimental for generalization, but random and SGD-optimized networks behave similarly, so this can be attributed to an architectural bias. For more information, see https://bias-sgd-or-architecture.github.io .

Bias of Stochastic Gradient Descent or the Architecture: Disentangling the Effects of Overparameterization of Neural Networks

TL;DR

The paper tackles why heavily overparameterized neural networks generalize, by disentangling SGD's implicit bias from architectural bias in a low-sample, binary-classification setting. It compares SGD to Guess-and-Check networks that randomly sample zero-training-error networks, using Lipschitz-based geometric-margin normalization to compare across architectures and parameter scales. The main findings show that increasing width improves SGD generalization due to its implicit bias, while increasing depth harms generalization due to architectural bias, with G&C largely insensitive to width but negatively affected by depth. These results illuminate when optimization dynamics versus architectural choices drive generalization, offering guidance for designing efficient architectures and training protocols. The approach hinges on zero-training-error networks, controlled initializations, and Lipschitz normalization to robustly assess generalization across models and datasets such as MNIST and CIFAR-10.

Abstract

Neural networks typically generalize well when fitting the data perfectly, even though they are heavily overparameterized. Many factors have been pointed out as the reason for this phenomenon, including an implicit bias of stochastic gradient descent (SGD) and a possible simplicity bias arising from the neural network architecture. The goal of this paper is to disentangle the factors that influence generalization stemming from optimization and architectural choices by studying random and SGD-optimized networks that achieve zero training error. We experimentally show, in the low sample regime, that overparameterization in terms of increasing width is beneficial for generalization, and this benefit is due to the bias of SGD and not due to an architectural bias. In contrast, for increasing depth, overparameterization is detrimental for generalization, but random and SGD-optimized networks behave similarly, so this can be attributed to an architectural bias. For more information, see https://bias-sgd-or-architecture.github.io .
Paper Structure (21 sections, 13 equations, 27 figures, 4 tables)

This paper contains 21 sections, 13 equations, 27 figures, 4 tables.

Figures (27)

  • Figure 1: Generalization of SGD (optimized) versus G&C (randomly sampled) in dependency of the prior on the weights $\mathrm{P}(W)$: We "train" $2000$ LeNet models to $100\%$ train accuracy for $16$ training samples from classes 0 and 7 of MNIST. Test accuracies for G&C are similar across initializations, and the normalized loss (see Section \ref{['sec:method']}) is similar across the uniform distributions. Column 1: For $\mathcal{U}[-1, 1]$ initialization, as used by chiang2022loss, the normalized losses and the test accuracies of SGD and G&C are similar, except for the convergence of SGD towards more low-margin solutions. The claim in chiang2022loss that the average test accuracy of G&C resembles SGD, conditional on the normalized loss bin (black dots), is an artifact of the suboptimal convergence of SGD caused by this initialization. Columns 2-4: For other initializations, SGD (first row) improves considerably both in terms of loss and accuracy. In contrast, G&C remains unaffected, as it is independent of the scale of the weights in each layer. Results for different numbers of samples and other classes from MNIST and CIFAR10 are in Appendix \ref{['sec:init-appendix']}.
  • Figure 2: Analysis of overparameterization when increasing the width. Test accuracy vs weight normalized loss \ref{['eq:weightnorm']} of chiang2022loss and our Lipschitz normalized loss \ref{['eq:lipschitznorm estimate']} of SGD and G&C for classes 0 & 7 of MNIST and 16 training samples across 2000 LeNet models. Row 4: Widening the networks enhances geometric margin (lower normalized loss) and average test accuracy for SGD, while for G&C(Row 2), the margin improves only slightly, and average test accuracy remains the same. This suggests that the improvement is mainly due to the bias of SGD and not due to an architectural bias (see Figure \ref{['fig:different_widths']}). Rows 1 and 3:chiang2022loss compare networks conditional on the (weight) normalized loss bin (illustrated by black boxes), which led them to conclude that G&C improves with increasing width. With our Lipschitz normalized loss, one would arrive at the opposite conclusion, which shows the problem of normalization. Results for different numbers of samples and other classes from MNIST and CIFAR10 are in Appendix \ref{['sec:width-appendix']}.
  • Figure 3: Increasing width is a positive optimization bias. From top to bottom, the architectures are LeNet, MLP, and ResNet. Within each architecture, the first row corresponds to MNIST and the second to CIFAR10. Left: Mean test accuracy vs number of training samples across network widths. SGD improves for wider networks, while G&C behaves similarly for all widths. Thus, for increasing width, SGD has a bias towards better generalizing networks independent of an architectural bias. However, there is also no overfitting for G&C and thus no sign that overparameterization hurts. Right: We report the negative log probability of G&C to find a network fitting the training data (${\mathrm{P}_W(\textrm{Train Error}=0)}$). This number remains the same for different widths, indicating that the pool of "fitting networks" does not change with increasing width. More class pairs of MNIST and CIFAR10 are provided in Appendix \ref{['sec:width-appendix']}.
  • Figure 4: Increasing depth is a negative architectural bias. From top to bottom, the architectures are LeNet, MLP, and ResNet. Within each architecture, the first row corresponds to MNIST and the second to CIFAR10. Configuration "2c-1f" means two convolutional layers followed by a fully connected layer. Left: Mean test accuracy vs number of training samples across network depths. G&C always performs worse as depth increases, whereas SGD stagnates or gets worse. Thus, overparameterization in terms of depth results in overfitting instead of better generalization, unlike for the width. Since both G&C and SGD follow a similar trend, the decrease in performance with increased depth can be attributed to architectural bias. Right: Deeper networks have a lower probability for G&C to fit the training data, indicating that the network produces more complex functions. More class pairs of MNIST and CIFAR10 are provided in Appendix \ref{['sec:depth-appendix']}.
  • Figure 5: Qualitative analysis of overparametrization in the depth. In contrast to increasing width, increasing depth decreases the geometric margin (higher normalized loss). This decrease holds both for G&C (top) and SGD (bottom). We show 2000 LeNet models for each depth for classes 0 and 7 using a training set of size 16. Results are more pronounced for harder class pairs (see Figure \ref{['fig:different_depths_loss_mnist_appendix']}).
  • ...and 22 more figures