Massive Redundancy in Gradient Transport Enables Sparse Online Learning

Aur Shalev Merin

Massive Redundancy in Gradient Transport Enables Sparse Online Learning

Aur Shalev Merin

Abstract

Real-time recurrent learning (RTRL) computes exact online gradients by propagating a Jacobian tensor forward through recurrent dynamics, but at O(n^4) cost per step. Prior work has sought structured approximations (rank-1 compression, graph-based sparsity, Kronecker factorization). We show that, in the continuous error signal regime, the recurrent Jacobian is massively redundant:propagating through a random 6% of paths (k=4 of n=64) recovers 84 +/- 6% of full RTRL's adaptation ability across five seeds, and the absolute count k=4 remains effective from n=64 to n=256 (6% to 1.6%, recovery 84 to 78%), meaning sparse RTRL becomes relatively cheaper as networks grow. In RNNs, the recovery is selection-invariant (even adversarial path selection works) and exhibits a step-function transition from zero to any nonzero propagation. Spectral analysis reveals the mechanism: the Jacobian is full-rank but near-isotropic (condition numbers 2.6-6.5), so any random subset provides a directionally representative gradient estimate. On chaotic dynamics (Lorenz attractor), sparse propagation is more numerically stable than full RTRL (CV 13% vs. 88%), as subsampling avoids amplifying pathological spectral modes. The redundancy extends to LSTMs (k=4 matches full RTRL) and to transformers via sparse gradient transport (50% head sparsity outperforms the dense reference; 33% is borderline), with higher thresholds reflecting head specialization rather than isotropy. On real primate neural data, sparse RTRL (k=4) adapts online to cross-session electrode drift (80 +/- 11% recovery, 5 seeds), where sparse propagation is again more stable than full RTRL. Without continuous error signal, Jacobian propagation accumulates numerical drift and degrades all RTRL variants, a scope condition for all forward-mode methods. Results hold with SGD (92 +/- 1% recovery), suggesting independence from optimizer choice.

Massive Redundancy in Gradient Transport Enables Sparse Online Learning

Abstract

Paper Structure (49 sections, 11 equations, 5 figures, 14 tables)

This paper contains 49 sections, 11 equations, 5 figures, 14 tables.

Introduction
Background
Real-Time Recurrent Learning
Eligibility Traces
Prior RTRL Approximations
Method
Sparse Jacobian Propagation
Implementation.
Subset construction.
Cost Analysis
Evaluation Protocol: Online Adaptation
Gap recovery metric.
Experiments
Metric.
Implementation validation.
...and 34 more sections

Figures (5)

Figure 1: The Jacobian is massively redundant. (a) Post-shift MSE drops by two orders of magnitude from $k{=}0$ (eligibility traces) to $k{=}4$ (6% of paths), then remains flat through $k{=}64$ (full RTRL). (b) Log-scale gap recovery shows a step function: any $k \geq 4$ recovers $82$--$87\%$ of full RTRL. Error bars: $\pm 1$ s.d. across 5 seeds.
Figure 2: Sparse RTRL is more stable than full RTRL on chaotic dynamics. Per-seed loss curves on the Lorenz attractor ($n{=}64$, 5 seeds). Full RTRL (red) degrades on 2 of 5 seeds; $k{=}4$ (blue) is stable on all seeds. Gray: eligibility traces ($k{=}0$). Vertical dotted line: parameter shift at $t{=}2000$.
Figure 3: Gradient transport redundancy across architectures and real neural data. (a) Recovery threshold comparison: RNNs require ${\sim}6\%$ of neurons, transformers ${\sim}33$--$50\%$ of heads, reflecting isotropy vs. specialization. (b) BCI cross-session adaptation: $k{=}4$ sparse RTRL adapts to 7-month electrode drift while the frozen decoder degrades and full RTRL is unstable on 2/5 seeds.
Figure 4: The Jacobian is full-rank and near-isotropic. (a) Singular value spectrum of the Jacobian tensor (reshaped to $n \times n^2$) at the shift point ($n{=}64$, sine task), showing nearly uniform singular values (condition number ${\sim}4$). Individual seeds shown. (b) Condition number remains low (2--5) throughout training across all 5 seeds, confirming isotropy is not an initialization artifact. Red dashed line: condition $= 10$, above which random subsampling would degrade.
Figure 5: Sparse gradients are directionally aligned with full RTRL gradients. (a) Cosine similarity between $k$-sparse and full RTRL gradients over time (5-seed mean $\pm$ s.d.). Pre-shift cosine is ${\sim}0.92$; post-shift dip to ${\sim}0.82$ at $k{=}4$ reflects the harder adaptation regime but never collapses. (b) Post-shift cosine increases monotonically with $k$, consistent with the flat recovery curve.

Massive Redundancy in Gradient Transport Enables Sparse Online Learning

Abstract

Massive Redundancy in Gradient Transport Enables Sparse Online Learning

Authors

Abstract

Table of Contents

Figures (5)