Table of Contents
Fetching ...

Temporal Credit Is Free

Aur Shalev Merin

Abstract

Recurrent networks do not need Jacobian propagation to adapt online. The hidden state already carries temporal credit through the forward pass; immediate derivatives suffice if you stop corrupting them with stale trace memory and normalize gradient scales across parameter groups. An architectural rule predicts when normalization is needed: \b{eta}2 is required when gradients must pass through a nonlinear state update with no output bypass, and unnecessary otherwise. Across ten architectures, real primate neural data, and streaming ML benchmarks, immediate derivatives with RMSprop match or exceed full RTRL, scaling to n = 1024 at 1000x less memory.

Temporal Credit Is Free

Abstract

Recurrent networks do not need Jacobian propagation to adapt online. The hidden state already carries temporal credit through the forward pass; immediate derivatives suffice if you stop corrupting them with stale trace memory and normalize gradient scales across parameter groups. An architectural rule predicts when normalization is needed: \b{eta}2 is required when gradients must pass through a nonlinear state update with no output bypass, and unnecessary otherwise. Across ten architectures, real primate neural data, and streaming ML benchmarks, immediate derivatives with RMSprop match or exceed full RTRL, scaling to n = 1024 at 1000x less memory.

Paper Structure

This paper contains 13 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Recovery vs. trace decay on a sine frequency-shift task (5 seeds, mean $\pm$ 1 std). The standard default ($\lambda = 0.95$) produces 0% recovery at both scales. Values below 0.5 form a safe plateau. The optimal decay decreases with network size: at $n=256$, all values $\leq 0.5$ are statistically indistinguishable.
  • Figure 2: (a) Optimizer isolation with immediate derivatives ($\lambda = 0$). Only methods with second-moment normalization ($\beta_2$) adapt. Momentum ($\beta_1$) adds nothing. (b) Gradient norms by parameter group in a trained vanilla RNN ($n=64$), showing the 100$\times$ scale mismatch between recurrent and output weights.
  • Figure 3: Adam/$\beta_2$ (circles) vs. SGD (squares) recovery across ten architectures. Red background: no output bypass ($\beta_2$ required). Green background: has output bypass (SGD suffices or is better). The gap between dots is the $\beta_2$ requirement.
  • Figure 4: (a) Memory scaling: immediate derivatives ($d{=}0$) vs. full RTRL. At $n=1024$ the gap is 1000$\times$ (12.6 MB vs. 12.9 GB). (b) Cross-session BCI decoding (7-month electrode drift, 5 seeds). $d{=}0$ + RMSprop (106%) exceeds all Jacobian-based methods.