Table of Contents
Fetching ...

Stabilizing RNN Gradients through Pre-training

Luca Herranz-Celotti, Jean Rouat

TL;DR

The paper tackles gradient instability in deep multi-layer RNNs and introduces pre-training to local stability (LSC) as a general, architecture-agnostic stabilization method. By revealing an additive gradient-explosion source from counting gradient paths across time and depth, it motivates weighting time and depth contributions to achieve $\mathbb{E}\rho \in \{0.5,1\}$ and demonstrates that a target radius of $\rho_t=0.5$ often yields better stability for deep models. It extends classical initialization theories (Glorot, He, Orthogonal) as special cases of LSC, and provides empirical evidence across differentiable and neuromorphic/state-space models that pre-training to LSC improves training and final performance. The approach offers a practical, scalable step prior to pre-training on large datasets, reducing the need for architecture-specific stable initializations and enabling stable training of very deep or complex recurrent architectures.

Abstract

Numerous theories of learning propose to prevent the gradient from exponential growth with depth or time, to stabilize and improve training. Typically, these analyses are conducted on feed-forward fully-connected neural networks or simple single-layer recurrent neural networks, given their mathematical tractability. In contrast, this study demonstrates that pre-training the network to local stability can be effective whenever the architectures are too complex for an analytical initialization. Furthermore, we extend known stability theories to encompass a broader family of deep recurrent networks, requiring minimal assumptions on data and parameter distribution, a theory we call the Local Stability Condition (LSC). Our investigation reveals that the classical Glorot, He, and Orthogonal initialization schemes satisfy the LSC when applied to feed-forward fully-connected neural networks. However, analysing deep recurrent networks, we identify a new additive source of exponential explosion that emerges from counting gradient paths in a rectangular grid in depth and time. We propose a new approach to mitigate this issue, that consists on giving a weight of a half to the time and depth contributions to the gradient, instead of the classical weight of one. Our empirical results confirm that pre-training both feed-forward and recurrent networks, for differentiable, neuromorphic and state-space models to fulfill the LSC, often results in improved final performance. This study contributes to the field by providing a means to stabilize networks of any complexity. Our approach can be implemented as an additional step before pre-training on large augmented datasets, and as an alternative to finding stable initializations analytically.

Stabilizing RNN Gradients through Pre-training

TL;DR

The paper tackles gradient instability in deep multi-layer RNNs and introduces pre-training to local stability (LSC) as a general, architecture-agnostic stabilization method. By revealing an additive gradient-explosion source from counting gradient paths across time and depth, it motivates weighting time and depth contributions to achieve and demonstrates that a target radius of often yields better stability for deep models. It extends classical initialization theories (Glorot, He, Orthogonal) as special cases of LSC, and provides empirical evidence across differentiable and neuromorphic/state-space models that pre-training to LSC improves training and final performance. The approach offers a practical, scalable step prior to pre-training on large datasets, reducing the need for architecture-specific stable initializations and enabling stable training of very deep or complex recurrent architectures.

Abstract

Numerous theories of learning propose to prevent the gradient from exponential growth with depth or time, to stabilize and improve training. Typically, these analyses are conducted on feed-forward fully-connected neural networks or simple single-layer recurrent neural networks, given their mathematical tractability. In contrast, this study demonstrates that pre-training the network to local stability can be effective whenever the architectures are too complex for an analytical initialization. Furthermore, we extend known stability theories to encompass a broader family of deep recurrent networks, requiring minimal assumptions on data and parameter distribution, a theory we call the Local Stability Condition (LSC). Our investigation reveals that the classical Glorot, He, and Orthogonal initialization schemes satisfy the LSC when applied to feed-forward fully-connected neural networks. However, analysing deep recurrent networks, we identify a new additive source of exponential explosion that emerges from counting gradient paths in a rectangular grid in depth and time. We propose a new approach to mitigate this issue, that consists on giving a weight of a half to the time and depth contributions to the gradient, instead of the classical weight of one. Our empirical results confirm that pre-training both feed-forward and recurrent networks, for differentiable, neuromorphic and state-space models to fulfill the LSC, often results in improved final performance. This study contributes to the field by providing a means to stabilize networks of any complexity. Our approach can be implemented as an additional step before pre-training on large augmented datasets, and as an alternative to finding stable initializations analytically.
Paper Structure (25 sections, 13 theorems, 49 equations, 7 figures, 2 tables)

This paper contains 25 sections, 13 theorems, 49 equations, 7 figures, 2 tables.

Key Result

Theorem 1

Be the multi-layer RNN in eq. eq:general_sys2. Setting the radii of every transition derivative $M_k$ to ${\rm I E}\rho=1$ gives an upper bound to the parameter update variance that increases with time and depth as the binomial coefficient $\frac{1}{T}\binom{T + L +2}{T}$. Instead, setting the radii

Figures (7)

  • Figure 1: Stabilizing to $\rho=1$ results in additive explosion while $\rho=0.5$ does not. a) In a $d$-RNN gradients need to traverse both the time and the depth dimension when updating the learnable parameters. A transition derivative $M_k$ represents only one arrow in the time and depth grid, and there are several multiplicative chains $j_i$ to be considered, since the parameter update is going to use $J^{T,L}_{t,l}$, the sum of all the multiplicative chains from $T,L$ down to $t,l$. b) However, the number of paths $j_i$ is described by the binomial coefficient $\binom{\Delta l + \Delta t}{\Delta t}$, and therefore increases exponentially when time and depth tend to infinity simultaneously, as in iii) and proven in App. \ref{['app:counting']}. In fact, an exponential growth looks like a straight line in a semi-log plot, as in iii). Instead, the aforementioned binomial coefficient grows only polynomially when either time or depth are kept fixed, as in i) and ii). c) We confirm experimentally our theoretical analysis, on a toy network and on the LSC pre-trained GRU: $\rho=1$ reveals an explosion of additive origin (right panels), while $\rho=0.5$ is able to stabilize gradients through time (left panels). The upper panels show network output (blue), derivative (orange), and our derivative bounds (green), for the toy network that we define as the PascalRNN, $\boldsymbol{h}_{t,l}=\rho\boldsymbol{h}_{t-1,l}+\rho\boldsymbol{h}_{t-1,l-1}$, of depth 10 and gaussian input of mean zero and standard deviation 2, and lower panels show a LSC pre-trained GRU network of depth 7 and the SHD task as input. Both upper bounds to the derivative under $\rho\in\{0.5,1\}$, are part of the proof for Thm. \ref{['thm:ulscboth']}, and $c_1, c_2$, defined in Thm. \ref{['thm:ulscboth']} proof, are task and network dependent constants, that do not depend on time nor depth. Notice that the growth of the derivative and of the bound is backwards in time since backpropagation accumulates gradients backwards in operations, from $T,L$ to $0,0$. This confirms that standard FFN theories ($\rho=1$) cannot be directly applied to $d$-RNN, since they result in an unexpected additive gradient exponential explosion that is not accounted for.
  • Figure 2: Bounds stabilization through pre-training enhances FFN learning. We investigate the effect that pre-training to achieve our Local Stability Condition (LSC) has on learning, for 30-layer FFNs. Such depth is not necessary to solve MNIST, and our interest is rather in confirming that we are able to stabilize gradients in very deep networks. Specifically, we stabilize the upper bound and we compare our results to two well-known initialization strategies for FFN, Glorot and He. Interestingly, even when the theoretically justified He initialization is available, such as for ReLU FFNs, the LSC can match or outperform it, even if it was initialized as Glorot before pre-training. In general, no theoretically justified initialization strategies are available for new architectures or unconventional activations, such as sine and cosine. Therefore, stability pre-training becomes a convenient approach to enhance learning. Notably, stabilizing to LSC tends to outperforms all other alternatives in most scenarios.
  • Figure 3: Pre-training to LSC $\rho_t=0.5$ has a stronger impact on deeper $d$-RNNs. We pre-train and train the $\sigma$-RNN, $ReLU$-RNN, GRU, LSTM networks, on the sl-MNIST, SHD and PTB tasks, for both $\rho_t\in\{0.5, 1\}$, and compare with the non pre-trained case ($none$ in the plot), for 4 random seeds each. We compare $d$-RNN networks with depth $L\in\{2,5\}$, since $\rho_t=0.5$ is expected by our theoretical analysis to have a stronger stabilizing effect for deeper networks. We find indeed that more than 63% of the times $\rho_t=0.5$ gave better performance than $\rho_t=1$ for both depths. No pre-training matched the rate of $\rho_t=0.5$ for depth $d=2$, but $\rho_t=0.5$ outperformed no pre-training 70% of the times with deeper networks. With the non differentiable networks ALIF$_+$ and ALIF$_\pm$, it was often impossible to converge to a target $\rho_t=0.5$, but we see that the deeper the network, the more favorable it was to pre-train to $\rho_t=1$.
  • Figure 4: Radius of 0.5 outperforms a radius of 1 on state-space models. We train the LRU state-space model on the PTB task, we see that clipping (dashed) does not have a significant effect, while $\rho_t=1$ is outperformed by $\rho_t=0.5$. Interestingly by default the LRU has an initialization close to $\rho_t=0.5$. We use $\overline{\rho}_t=0.5$ to denote that the time component of the gradient is given a stronger weight than the depth component.
  • Figure S1: Grid Search for best learning rate. We show in the plots perplexity against learning rate, where perplexity is the exponentiation of the cross-entropy loss, to have more homogeneous plots. Grid search for optimal learning rate was performed on LSTM and ALIF. Optimal LSTM learning rate was used on the differentiable architectures, and optimal ALIF learning rate was used on the non-differentiable architectures.
  • ...and 2 more figures

Theorems & Definitions (23)

  • Theorem 1: Local Stability Condition, with radii $\E\rho\in \{0.5, 1\}$
  • Corollary 1
  • Lemma 1: Expected Spectral Norm
  • Definition : Decaying Covariance
  • Theorem 2: Local Stability Condition, with matrix norms $\E a_k^q=1$
  • proof
  • Theorem 3: Local Stability Condition, with matrix norms $\E a_k=0.5$
  • proof
  • Theorem 3: Local Stability Condition, with radii $\E\rho\in \{0.5, 1\}$
  • proof
  • ...and 13 more