Table of Contents
Fetching ...

Revisiting Glorot Initialization for Long-Range Linear Recurrences

Noga Bar, Mariia Seleznova, Yotam Alexander, Gitta Kutyniok, Raja Giryes

TL;DR

The paper analyzes the stability of Glorot initialization for linear RNNs in the long-sequence regime and shows that, despite the bulk spectrum lying within the unit disk, the spectral edge typically lies above one, causing hidden-state explosion under double-scaling (infinite width and length). It then introduces a simple, dimension-aware rescaling of Glorot that shifts the spectral radius slightly below one using a calibrated parameter, providing both theoretical guarantees and empirical validation of improved stability. The authors develop a variance-based lower bound on hidden-state growth for finite width and length, show distinct behaviors in finite versus double-scaling regimes, and demonstrate that the rescaled initializer maintains trainability on long-range benchmarks. These results motivate a separate theoretical treatment of recurrent initialization under long sequences and offer a practical baseline for stable long-range sequence modeling.

Abstract

Proper initialization is critical for Recurrent Neural Networks (RNNs), particularly in long-range reasoning tasks, where repeated application of the same weight matrix can cause vanishing or exploding signals. A common baseline for linear recurrences is Glorot initialization, designed to ensure stable signal propagation--but derived under the infinite-width, fixed-length regime--an unrealistic setting for RNNs processing long sequences. In this work, we show that Glorot initialization is in fact unstable: small positive deviations in the spectral radius are amplified through time and cause the hidden state to explode. Our theoretical analysis demonstrates that sequences of length $t = O(\sqrt{n})$, where $n$ is the hidden width, are sufficient to induce instability. To address this, we propose a simple, dimension-aware rescaling of Glorot that shifts the spectral radius slightly below one, preventing rapid signal explosion or decay. These results suggest that standard initialization schemes may break down in the long-sequence regime, motivating a separate line of theory for stable recurrent initialization.

Revisiting Glorot Initialization for Long-Range Linear Recurrences

TL;DR

The paper analyzes the stability of Glorot initialization for linear RNNs in the long-sequence regime and shows that, despite the bulk spectrum lying within the unit disk, the spectral edge typically lies above one, causing hidden-state explosion under double-scaling (infinite width and length). It then introduces a simple, dimension-aware rescaling of Glorot that shifts the spectral radius slightly below one using a calibrated parameter, providing both theoretical guarantees and empirical validation of improved stability. The authors develop a variance-based lower bound on hidden-state growth for finite width and length, show distinct behaviors in finite versus double-scaling regimes, and demonstrate that the rescaled initializer maintains trainability on long-range benchmarks. These results motivate a separate theoretical treatment of recurrent initialization under long sequences and offer a practical baseline for stable long-range sequence modeling.

Abstract

Proper initialization is critical for Recurrent Neural Networks (RNNs), particularly in long-range reasoning tasks, where repeated application of the same weight matrix can cause vanishing or exploding signals. A common baseline for linear recurrences is Glorot initialization, designed to ensure stable signal propagation--but derived under the infinite-width, fixed-length regime--an unrealistic setting for RNNs processing long sequences. In this work, we show that Glorot initialization is in fact unstable: small positive deviations in the spectral radius are amplified through time and cause the hidden state to explode. Our theoretical analysis demonstrates that sequences of length , where is the hidden width, are sufficient to induce instability. To address this, we propose a simple, dimension-aware rescaling of Glorot that shifts the spectral radius slightly below one, preventing rapid signal explosion or decay. These results suggest that standard initialization schemes may break down in the long-sequence regime, motivating a separate line of theory for stable recurrent initialization.

Paper Structure

This paper contains 29 sections, 9 theorems, 40 equations, 4 figures, 2 tables.

Key Result

Theorem 3.2

Assume a matrix $\mathbf{W}\in\mathbb{F}^{n\times n}$ with $\mathbb{F}\in\{\mathbb{R},\mathbb{C}\}$ is sampled from a complex or real Glorot initialization. Then, the empirical distribution of the eigenvalues of $\mathbf{W}$, denoted $\mu_n(\lambda)$, converges almost surely to a uniform distributio

Figures (4)

  • Figure 1: Top vs bottom: Glorot and rescaled initialization, respectively. Left: Empirical density of spectral radius for 100 independent samples of $\mathbf{W} \in \mathbb{R}^{500\times500}$, with entries drawn i.i.d. from $\mathcal{N}(0,1/n)$. Glorot often yields eigenvalues with magnitudes exceeding one. Middle: Norms of $\|\mathbf{W}^k \mathbf{x}\|_2$ as a function of $k$, with each curve corresponding to a different realization of $\mathbf{W}$. Inputs $\mathbf{x}$ are sampled i.i.d. from $\mathcal{N}(0,\mathbb{I}_{500})$. We report the mean and variance over $50$ random input samples. Glorot leads to exploding norms; the rescaled variant produces slowly decaying norms. Right: Norms of hidden states $\|\mathbf{h}_t\|_2$ in a recurrent layer with i.i.d. Gaussian inputs. The mean and variance are computed over $50$ input realizations. Glorot initialization results in unstable (exploding) hidden states, while the rescaled initialization maintains stability over time.
  • Figure 2: Top vs. bottom: Complex Glorot and rescaled initialization, respectively. Left: Empirical density of spectral radius for 100 independent samples of $\mathbf{W} \in \mathbb{C}^{500\times500}$, with entries drawn i.i.d. according to complex Glorot (\ref{['eq:complex_glorot']}). Glorot often yields eigenvalues with magnitudes exceeding one. Middle: Norms of $\|\mathbf{W}^k \mathbf{x}\|_2$ as a function of $k$, with each curve corresponding to a different realization of $\mathbf{W}$. Inputs $\mathbf{x}$ are sampled i.i.d. from $\mathcal{N}(0,\mathbb{I}_{500})$. We report the mean and variance over $50$ random input samples. Glorot leads to exploding norms; the rescaled variant produces slowly decaying norms. Right: Norms of hidden states $\|\mathbf{h}_k\|_2$ in a recurrent layer with i.i.d. Gaussian inputs. The mean and variance are computed over $50$ input realizations. Glorot initialization results in unstable (exploding) hidden states, while the rescaled initialization maintains stability over time.
  • Figure 3: We show histograms of the maximal eigenvalue sizes for both the real and the complex Glorot ensembles (i.e Glorot initializations) . One can see that the behavior is similar overall. However, there are two notable differences: the mean of the real Glorot matrices is slightly smaller than in the complex case, but nevertheless the right tail of the distribution is longer. While the typical size of the largest eigenvalue is similar in both cases, this latter feature results in the expected $k$-th absolute moment being larger in the real case for large $k$.
  • Figure 4: Norms of hidden states $\|\mathbf{h}_k\|_2$ in a recurrent layer with i.i.d. Gaussian inputs. The mean is computed over 100 realizations of $\mathbf{W}$ and 5 independent inputs. Both real and complex Glorot initializations lead to similar unstable (exploding) hidden states.

Theorems & Definitions (13)

  • Remark 3.1
  • Theorem 3.2: Circular law
  • Theorem 4.1: Spectral radius, Theorem 1.1 of rider2014extremal and Theorem 1 of rider2003limit
  • Corollary 4.2
  • Theorem 6.1: Variance lower bound
  • Remark 6.2: Real case
  • Theorem : Variance lower bound
  • proof
  • Proposition 11.1: Real-eigenvalue density Edelman1994HowME
  • Proposition 11.2: Complex‐eigenvalue density 10.1006/jmva.1996.1653
  • ...and 3 more