Bias of Stochastic Gradient Descent or the Architecture: Disentangling the Effects of Overparameterization of Neural Networks
Amit Peleg, Matthias Hein
TL;DR
The paper tackles why heavily overparameterized neural networks generalize, by disentangling SGD's implicit bias from architectural bias in a low-sample, binary-classification setting. It compares SGD to Guess-and-Check networks that randomly sample zero-training-error networks, using Lipschitz-based geometric-margin normalization to compare across architectures and parameter scales. The main findings show that increasing width improves SGD generalization due to its implicit bias, while increasing depth harms generalization due to architectural bias, with G&C largely insensitive to width but negatively affected by depth. These results illuminate when optimization dynamics versus architectural choices drive generalization, offering guidance for designing efficient architectures and training protocols. The approach hinges on zero-training-error networks, controlled initializations, and Lipschitz normalization to robustly assess generalization across models and datasets such as MNIST and CIFAR-10.
Abstract
Neural networks typically generalize well when fitting the data perfectly, even though they are heavily overparameterized. Many factors have been pointed out as the reason for this phenomenon, including an implicit bias of stochastic gradient descent (SGD) and a possible simplicity bias arising from the neural network architecture. The goal of this paper is to disentangle the factors that influence generalization stemming from optimization and architectural choices by studying random and SGD-optimized networks that achieve zero training error. We experimentally show, in the low sample regime, that overparameterization in terms of increasing width is beneficial for generalization, and this benefit is due to the bias of SGD and not due to an architectural bias. In contrast, for increasing depth, overparameterization is detrimental for generalization, but random and SGD-optimized networks behave similarly, so this can be attributed to an architectural bias. For more information, see https://bias-sgd-or-architecture.github.io .
