Table of Contents
Fetching ...

The exploding gradient problem demystified - definition, prevalence, impact, origin, tradeoffs, and solutions

George Philipp, Dawn Song, Jaime G. Carbonell

TL;DR

The paper challenges the notion that normalization and modern activations fully solve exploding gradients by introducing the gradient scale coefficient (GSC) and the residual trick. It demonstrates that exploding gradients can persist in many popular MLP architectures and shows how they limit effective depth, while ResNet-style skip connections and an orthogonal/structured initialization can substantially mitigate gradient growth. The authors connect gradient explosions to deeper theoretical constructs like entropy and domain collapsing, and they reveal tradeoffs between avoiding explosion and avoiding pseudo-linearity. Practically, the work provides a rigorous framework (GSC) for diagnosing gradient pathology and actionable guidance on initialization and dilution strategies to push training of very deep networks further.

Abstract

Whereas it is believed that techniques such as Adam, batch normalization and, more recently, SeLU nonlinearities "solve" the exploding gradient problem, we show that this is not the case in general and that in a range of popular MLP architectures, exploding gradients exist and that they limit the depth to which networks can be effectively trained, both in theory and in practice. We explain why exploding gradients occur and highlight the *collapsing domain problem*, which can arise in architectures that avoid exploding gradients. ResNets have significantly lower gradients and thus can circumvent the exploding gradient problem, enabling the effective training of much deeper networks. We show this is a direct consequence of the Pythagorean equation. By noticing that *any neural network is a residual network*, we devise the *residual trick*, which reveals that introducing skip connections simplifies the network mathematically, and that this simplicity may be the major cause for their success.

The exploding gradient problem demystified - definition, prevalence, impact, origin, tradeoffs, and solutions

TL;DR

The paper challenges the notion that normalization and modern activations fully solve exploding gradients by introducing the gradient scale coefficient (GSC) and the residual trick. It demonstrates that exploding gradients can persist in many popular MLP architectures and shows how they limit effective depth, while ResNet-style skip connections and an orthogonal/structured initialization can substantially mitigate gradient growth. The authors connect gradient explosions to deeper theoretical constructs like entropy and domain collapsing, and they reveal tradeoffs between avoiding explosion and avoiding pseudo-linearity. Practically, the work provides a rigorous framework (GSC) for diagnosing gradient pathology and actionable guidance on initialization and dilution strategies to push training of very deep networks further.

Abstract

Whereas it is believed that techniques such as Adam, batch normalization and, more recently, SeLU nonlinearities "solve" the exploding gradient problem, we show that this is not the case in general and that in a range of popular MLP architectures, exploding gradients exist and that they limit the depth to which networks can be effectively trained, both in theory and in practice. We explain why exploding gradients occur and highlight the *collapsing domain problem*, which can arise in architectures that avoid exploding gradients. ResNets have significantly lower gradients and thus can circumvent the exploding gradient problem, enabling the effective training of much deeper networks. We show this is a direct consequence of the Pythagorean equation. By noticing that *any neural network is a residual network*, we devise the *residual trick*, which reveals that introducing skip connections simplifies the network mathematically, and that this simplicity may be the major cause for their success.

Paper Structure

This paper contains 94 sections, 14 theorems, 49 equations, 13 figures, 3 tables, 2 algorithms.

Key Result

Proposition 1

Consider any $r > 1$ and any neural network $f$ which can be trained to some error level in a certain number of steps by some gradient-based algorithm. There exists a network $f'$ with the same nominal and compositional depth as $f$ that can also be trained to the same error level as $f$ and to make

Figures (13)

  • Figure 1: Key metrics for architectures in their randomly initialized state evaluated on Gaussian noise. The x axis in A shows depth in terms of the number of linear layers counted from the input. The x axis in B-F counts nonlinearity layers, starting from the input. Note: The curve for layer-ReLU is shadowed by tanh in A, by ReLU in E and F and by SELU among others in C.
  • Figure 2: Illustrations of networks of different architectures as functions of the parameter in a single linear layer. For each network architecture as indicated under (C-K) with 50 linear layers, three random weight configurations are chosen that differ only at a single linear layer as indicated. For each location on the sphere centered on the origin containing those three configurations, the input shown in A from the CIFAR10 dataset is propagated through the network with weights indicated by that location. The length of the 3-dimensional output of the prediction layer is then normalized. Each location on the sphere is colored according to this output as shown in B. Weight configurations where the input is assigned class 1/2/3 are shown in red/green/blue respectively. Discs B through K are azimuthal projections. See section \ref{['detailssection']} for details.
  • Figure 3: Key metrics for exploding architectures trained on CIFAR10. See main text for explanation.
  • Figure 4: Illustration of theorem \ref{['bigGradTheorem']}. See main text for details.
  • Figure 5: The phenomenon of pseudo-linearity in ReLU and tanh nonlinearities. The nonlinearity function is shown in blue, the nonlinearity applied to 50 individual pre-activations drawn from a Gaussian with mean $\mu$ and standard deviation $\sigma$ are shown as red dots. The closest linear fit to the 50 post-activations is shown as a red line, and it approximates these post-activations very closely. A: ReLU, $\mu=-0.3$, $\sigma=0.2$. B: ReLU, $\mu=0.3$, $\sigma=0.2$. C: tanh, $\mu=0$, $\sigma=0.2$
  • ...and 8 more figures

Theorems & Definitions (14)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Proposition 5
  • Proposition 6
  • Proposition 7
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • ...and 4 more