Table of Contents
Fetching ...

Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks

Lorenzo Livi

TL;DR

Overall, this work provides a unified dynamical systems perspective on how gating couples state evolution with parameter updates, clarifying why gated architectures achieve robust trainability in practice.

Abstract

We show that gating mechanisms in recurrent neural networks (RNNs) induce lag-dependent and direction-dependent effective learning rates, even when training uses a fixed, global step size. This behavior arises from a coupling between state-space time-scales (parametrized by the gates) and parameter-space dynamics during gradient descent. By deriving exact Jacobians for leaky-integrator and gated RNNs and applying a first-order expansion, we make explicit how constant, scalar, and multi-dimensional gates reshape gradient propagation, modulate effective step sizes, and introduce anisotropy in parameter updates. These findings reveal that gates act not only as filters of information flow, but also as data-driven preconditioners of optimization, with formal connections to learning-rate schedules, momentum, and adaptive methods such as Adam. Empirical simulations corroborate these predictions: across several sequence tasks, gates produce lag-dependent effective learning rates and concentrate gradient flow into low-dimensional subspaces, matching or exceeding the anisotropic structure induced by Adam. Notably, gating and optimizer-driven adaptivity shape complementary aspects of credit assignment: gates align state-space transport with loss-relevant directions, while optimizers rescale parameter-space updates. Overall, this work provides a unified dynamical systems perspective on how gating couples state evolution with parameter updates, clarifying why gated architectures achieve robust trainability in practice.

Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks

TL;DR

Overall, this work provides a unified dynamical systems perspective on how gating couples state evolution with parameter updates, clarifying why gated architectures achieve robust trainability in practice.

Abstract

We show that gating mechanisms in recurrent neural networks (RNNs) induce lag-dependent and direction-dependent effective learning rates, even when training uses a fixed, global step size. This behavior arises from a coupling between state-space time-scales (parametrized by the gates) and parameter-space dynamics during gradient descent. By deriving exact Jacobians for leaky-integrator and gated RNNs and applying a first-order expansion, we make explicit how constant, scalar, and multi-dimensional gates reshape gradient propagation, modulate effective step sizes, and introduce anisotropy in parameter updates. These findings reveal that gates act not only as filters of information flow, but also as data-driven preconditioners of optimization, with formal connections to learning-rate schedules, momentum, and adaptive methods such as Adam. Empirical simulations corroborate these predictions: across several sequence tasks, gates produce lag-dependent effective learning rates and concentrate gradient flow into low-dimensional subspaces, matching or exceeding the anisotropic structure induced by Adam. Notably, gating and optimizer-driven adaptivity shape complementary aspects of credit assignment: gates align state-space transport with loss-relevant directions, while optimizers rescale parameter-space updates. Overall, this work provides a unified dynamical systems perspective on how gating couples state evolution with parameter updates, clarifying why gated architectures achieve robust trainability in practice.

Paper Structure

This paper contains 52 sections, 79 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Scalar-gated RNN. Top: truncation error versus $\varepsilon$ (left) and normalized remainder $C_2(\varepsilon)$ (right). Bottom: per-step norms $\|A_j\|_2$ and $\|B_j\|_2$ (left) and distribution of ratios $r_j$ (right).
  • Figure 2: Multi-gated RNN. Top: truncation error versus $\varepsilon$ (left) and normalized remainder $C_2(\varepsilon)$ (right). Bottom: per-step norms $\|A_j\|_2$ and $\|B_j\|_2$ (left) and distribution of ratios $r_j$ (right).
  • Figure 3: Leaky RNN (constant $\alpha$): normalized effective LR profile at final checkpoint (left), slope $s(\ell)$ across iterations (middle), and full sensitivity heatmap(right).
  • Figure 4: Scalar-gated RNN: normalized effective LR profile at final checkpoint (left), slope $s(\ell)$ across iterations (middle), and full sensitivity heatmap (right).
  • Figure 5: Multi-gated RNN: normalized effective LR profile at final checkpoint (left), slope $s(\ell)$ across iterations (middle), and full sensitivity heatmap (right).
  • ...and 5 more figures

Theorems & Definitions (1)

  • Definition 8.1: Fréchet differentiability krantz2003implicithigham2008functions