Stabilizing RNN Gradients through Pre-training
Luca Herranz-Celotti, Jean Rouat
TL;DR
The paper tackles gradient instability in deep multi-layer RNNs and introduces pre-training to local stability (LSC) as a general, architecture-agnostic stabilization method. By revealing an additive gradient-explosion source from counting gradient paths across time and depth, it motivates weighting time and depth contributions to achieve $\mathbb{E}\rho \in \{0.5,1\}$ and demonstrates that a target radius of $\rho_t=0.5$ often yields better stability for deep models. It extends classical initialization theories (Glorot, He, Orthogonal) as special cases of LSC, and provides empirical evidence across differentiable and neuromorphic/state-space models that pre-training to LSC improves training and final performance. The approach offers a practical, scalable step prior to pre-training on large datasets, reducing the need for architecture-specific stable initializations and enabling stable training of very deep or complex recurrent architectures.
Abstract
Numerous theories of learning propose to prevent the gradient from exponential growth with depth or time, to stabilize and improve training. Typically, these analyses are conducted on feed-forward fully-connected neural networks or simple single-layer recurrent neural networks, given their mathematical tractability. In contrast, this study demonstrates that pre-training the network to local stability can be effective whenever the architectures are too complex for an analytical initialization. Furthermore, we extend known stability theories to encompass a broader family of deep recurrent networks, requiring minimal assumptions on data and parameter distribution, a theory we call the Local Stability Condition (LSC). Our investigation reveals that the classical Glorot, He, and Orthogonal initialization schemes satisfy the LSC when applied to feed-forward fully-connected neural networks. However, analysing deep recurrent networks, we identify a new additive source of exponential explosion that emerges from counting gradient paths in a rectangular grid in depth and time. We propose a new approach to mitigate this issue, that consists on giving a weight of a half to the time and depth contributions to the gradient, instead of the classical weight of one. Our empirical results confirm that pre-training both feed-forward and recurrent networks, for differentiable, neuromorphic and state-space models to fulfill the LSC, often results in improved final performance. This study contributes to the field by providing a means to stabilize networks of any complexity. Our approach can be implemented as an additional step before pre-training on large augmented datasets, and as an alternative to finding stable initializations analytically.
