The exploding gradient problem demystified - definition, prevalence, impact, origin, tradeoffs, and solutions
George Philipp, Dawn Song, Jaime G. Carbonell
TL;DR
The paper challenges the notion that normalization and modern activations fully solve exploding gradients by introducing the gradient scale coefficient (GSC) and the residual trick. It demonstrates that exploding gradients can persist in many popular MLP architectures and shows how they limit effective depth, while ResNet-style skip connections and an orthogonal/structured initialization can substantially mitigate gradient growth. The authors connect gradient explosions to deeper theoretical constructs like entropy and domain collapsing, and they reveal tradeoffs between avoiding explosion and avoiding pseudo-linearity. Practically, the work provides a rigorous framework (GSC) for diagnosing gradient pathology and actionable guidance on initialization and dilution strategies to push training of very deep networks further.
Abstract
Whereas it is believed that techniques such as Adam, batch normalization and, more recently, SeLU nonlinearities "solve" the exploding gradient problem, we show that this is not the case in general and that in a range of popular MLP architectures, exploding gradients exist and that they limit the depth to which networks can be effectively trained, both in theory and in practice. We explain why exploding gradients occur and highlight the *collapsing domain problem*, which can arise in architectures that avoid exploding gradients. ResNets have significantly lower gradients and thus can circumvent the exploding gradient problem, enabling the effective training of much deeper networks. We show this is a direct consequence of the Pythagorean equation. By noticing that *any neural network is a residual network*, we devise the *residual trick*, which reveals that introducing skip connections simplifies the network mathematically, and that this simplicity may be the major cause for their success.
