Table of Contents
Fetching ...

Massive Redundancy in Gradient Transport Enables Sparse Online Learning

Aur Shalev Merin

Abstract

Real-time recurrent learning (RTRL) computes exact online gradients by propagating a Jacobian tensor forward through recurrent dynamics, but at O(n^4) cost per step. Prior work has sought structured approximations (rank-1 compression, graph-based sparsity, Kronecker factorization). We show that, in the continuous error signal regime, the recurrent Jacobian is massively redundant:propagating through a random 6% of paths (k=4 of n=64) recovers 84 +/- 6% of full RTRL's adaptation ability across five seeds, and the absolute count k=4 remains effective from n=64 to n=256 (6% to 1.6%, recovery 84 to 78%), meaning sparse RTRL becomes relatively cheaper as networks grow. In RNNs, the recovery is selection-invariant (even adversarial path selection works) and exhibits a step-function transition from zero to any nonzero propagation. Spectral analysis reveals the mechanism: the Jacobian is full-rank but near-isotropic (condition numbers 2.6-6.5), so any random subset provides a directionally representative gradient estimate. On chaotic dynamics (Lorenz attractor), sparse propagation is more numerically stable than full RTRL (CV 13% vs. 88%), as subsampling avoids amplifying pathological spectral modes. The redundancy extends to LSTMs (k=4 matches full RTRL) and to transformers via sparse gradient transport (50% head sparsity outperforms the dense reference; 33% is borderline), with higher thresholds reflecting head specialization rather than isotropy. On real primate neural data, sparse RTRL (k=4) adapts online to cross-session electrode drift (80 +/- 11% recovery, 5 seeds), where sparse propagation is again more stable than full RTRL. Without continuous error signal, Jacobian propagation accumulates numerical drift and degrades all RTRL variants, a scope condition for all forward-mode methods. Results hold with SGD (92 +/- 1% recovery), suggesting independence from optimizer choice.

Massive Redundancy in Gradient Transport Enables Sparse Online Learning

Abstract

Real-time recurrent learning (RTRL) computes exact online gradients by propagating a Jacobian tensor forward through recurrent dynamics, but at O(n^4) cost per step. Prior work has sought structured approximations (rank-1 compression, graph-based sparsity, Kronecker factorization). We show that, in the continuous error signal regime, the recurrent Jacobian is massively redundant:propagating through a random 6% of paths (k=4 of n=64) recovers 84 +/- 6% of full RTRL's adaptation ability across five seeds, and the absolute count k=4 remains effective from n=64 to n=256 (6% to 1.6%, recovery 84 to 78%), meaning sparse RTRL becomes relatively cheaper as networks grow. In RNNs, the recovery is selection-invariant (even adversarial path selection works) and exhibits a step-function transition from zero to any nonzero propagation. Spectral analysis reveals the mechanism: the Jacobian is full-rank but near-isotropic (condition numbers 2.6-6.5), so any random subset provides a directionally representative gradient estimate. On chaotic dynamics (Lorenz attractor), sparse propagation is more numerically stable than full RTRL (CV 13% vs. 88%), as subsampling avoids amplifying pathological spectral modes. The redundancy extends to LSTMs (k=4 matches full RTRL) and to transformers via sparse gradient transport (50% head sparsity outperforms the dense reference; 33% is borderline), with higher thresholds reflecting head specialization rather than isotropy. On real primate neural data, sparse RTRL (k=4) adapts online to cross-session electrode drift (80 +/- 11% recovery, 5 seeds), where sparse propagation is again more stable than full RTRL. Without continuous error signal, Jacobian propagation accumulates numerical drift and degrades all RTRL variants, a scope condition for all forward-mode methods. Results hold with SGD (92 +/- 1% recovery), suggesting independence from optimizer choice.
Paper Structure (49 sections, 11 equations, 5 figures, 14 tables)

This paper contains 49 sections, 11 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: The Jacobian is massively redundant. (a) Post-shift MSE drops by two orders of magnitude from $k{=}0$ (eligibility traces) to $k{=}4$ (6% of paths), then remains flat through $k{=}64$ (full RTRL). (b) Log-scale gap recovery shows a step function: any $k \geq 4$ recovers $82$--$87\%$ of full RTRL. Error bars: $\pm 1$ s.d. across 5 seeds.
  • Figure 2: Sparse RTRL is more stable than full RTRL on chaotic dynamics. Per-seed loss curves on the Lorenz attractor ($n{=}64$, 5 seeds). Full RTRL (red) degrades on 2 of 5 seeds; $k{=}4$ (blue) is stable on all seeds. Gray: eligibility traces ($k{=}0$). Vertical dotted line: parameter shift at $t{=}2000$.
  • Figure 3: Gradient transport redundancy across architectures and real neural data. (a) Recovery threshold comparison: RNNs require ${\sim}6\%$ of neurons, transformers ${\sim}33$--$50\%$ of heads, reflecting isotropy vs. specialization. (b) BCI cross-session adaptation: $k{=}4$ sparse RTRL adapts to 7-month electrode drift while the frozen decoder degrades and full RTRL is unstable on 2/5 seeds.
  • Figure 4: The Jacobian is full-rank and near-isotropic. (a) Singular value spectrum of the Jacobian tensor (reshaped to $n \times n^2$) at the shift point ($n{=}64$, sine task), showing nearly uniform singular values (condition number ${\sim}4$). Individual seeds shown. (b) Condition number remains low (2--5) throughout training across all 5 seeds, confirming isotropy is not an initialization artifact. Red dashed line: condition $= 10$, above which random subsampling would degrade.
  • Figure 5: Sparse gradients are directionally aligned with full RTRL gradients. (a) Cosine similarity between $k$-sparse and full RTRL gradients over time (5-seed mean $\pm$ s.d.). Pre-shift cosine is ${\sim}0.92$; post-shift dip to ${\sim}0.82$ at $k{=}4$ reflects the harder adaptation regime but never collapses. (b) Post-shift cosine increases monotonically with $k$, consistent with the flat recovery curve.