Table of Contents
Fetching ...

Big Neural Networks Waste Capacity

Yann N. Dauphin, Yoshua Bengio

TL;DR

The paper investigates whether optimization limits, rather than capacity, constrain training of very large neural networks on ImageNet. By systematically varying the hidden-unit count in a one-hidden-layer MLP and analyzing training error and the ROI of added capacity, the authors show that while capacity reduces training error, the benefit rapidly diminishes, and large models may not outperform simple baselines after fixed training time. They argue that first-order gradient descent struggles in regimes with many interacting units, pointing to Hessian conditioning as a likely culprit. The work motivates optimization- and parametrization-centered approaches, such as sparsity/orthogonality penalties and stochastic second-order or natural-gradient methods, as well as deeper-network investigations, to unlock the benefits of scale on large datasets.

Abstract

This article exposes the failure of some big neural networks to leverage added capacity to reduce underfitting. Past research suggest diminishing returns when increasing the size of neural networks. Our experiments on ImageNet LSVRC-2010 show that this may be due to the fact there are highly diminishing returns for capacity in terms of training error, leading to underfitting. This suggests that the optimization method - first order gradient descent - fails at this regime. Directly attacking this problem, either through the optimization method or the choices of parametrization, may allow to improve the generalization error on large datasets, for which a large capacity is required.

Big Neural Networks Waste Capacity

TL;DR

The paper investigates whether optimization limits, rather than capacity, constrain training of very large neural networks on ImageNet. By systematically varying the hidden-unit count in a one-hidden-layer MLP and analyzing training error and the ROI of added capacity, the authors show that while capacity reduces training error, the benefit rapidly diminishes, and large models may not outperform simple baselines after fixed training time. They argue that first-order gradient descent struggles in regimes with many interacting units, pointing to Hessian conditioning as a likely culprit. The work motivates optimization- and parametrization-centered approaches, such as sparsity/orthogonality penalties and stochastic second-order or natural-gradient methods, as well as deeper-network investigations, to unlock the benefits of scale on large datasets.

Abstract

This article exposes the failure of some big neural networks to leverage added capacity to reduce underfitting. Past research suggest diminishing returns when increasing the size of neural networks. Our experiments on ImageNet LSVRC-2010 show that this may be due to the fact there are highly diminishing returns for capacity in terms of training error, leading to underfitting. This suggests that the optimization method - first order gradient descent - fails at this regime. Directly attacking this problem, either through the optimization method or the choices of parametrization, may allow to improve the generalization error on large datasets, for which a large capacity is required.

Paper Structure

This paper contains 4 sections, 3 figures.

Figures (3)

  • Figure 1: Training error with respect to the capacity of a 1-layer sigmoidal neural network. This curve seems to suggest we are correctly leveraging added capacity.
  • Figure 2: Return on investment on the addition of hidden units for a 1-hidden layer sigmoidal neural network. The vertical axis is the number of training errors removed per additional hidden unit, after 300 epochs. We see here that it is harder and harder to use added capacity.
  • Figure 3: Training error with respect to the number of epochs of gradient descent. Each line is a 1-hidden layer sigmoidal neural network with a different number of hidden units.