Efficient Sparse Selective-Update RNNs for Long-Range Sequence Modeling

Bojian Yin; Shurong Wang; Haoyu Tan; Sander Bohte; Federico Corradi; Guoqi Li

Efficient Sparse Selective-Update RNNs for Long-Range Sequence Modeling

Bojian Yin, Shurong Wang, Haoyu Tan, Sander Bohte, Federico Corradi, Guoqi Li

TL;DR

This work establishes a new direction for achieving Transformer-level performance within the highly efficient framework of recurrent modeling, by allowing each neuron to learn its own update timescale and resolving the mismatch between how long a sequence is and how much information it actually contains.

Abstract

Real-world sequential signals, such as audio or video, contain critical information that is often embedded within long periods of silence or noise. While recurrent neural networks (RNNs) are designed to process such data efficiently, they often suffer from ``memory decay'' due to a rigid update schedule: they typically update their internal state at every time step, even when the input is static. This constant activity forces the model to overwrite its own memory and makes it hard for the learning signal to reach back to distant past events. Here we show that we can overcome this limitation using Selective-Update RNNs (suRNNs), a non-linear architecture that learns to preserve its memory when the input is redundant. By using a neuron-level binary switch that only opens for informative events, suRNNs decouple the recurrent updates from the raw sequence length. This mechanism allows the model to maintain an exact, unchanged memory of the past during low-information intervals, creating a direct path for gradients to flow across time. Our experiments on the Long Range Arena, WikiText, and other synthetic benchmarks show that suRNNs match or exceed the accuracy of much more complex models such as Transformers, while remaining significantly more efficient for long-term storage. By allowing each neuron to learn its own update timescale, our approach resolves the mismatch between how long a sequence is and how much information it actually contains. By providing a principled approach to managing temporal information density, this work establishes a new direction for achieving Transformer-level performance within the highly efficient framework of recurrent modeling.

Efficient Sparse Selective-Update RNNs for Long-Range Sequence Modeling

TL;DR

Abstract

Paper Structure (40 sections, 27 equations, 13 figures, 7 tables)

This paper contains 40 sections, 27 equations, 13 figures, 7 tables.

Introduction
Background
Recurrent Neural Network
Backpropagation Through Time
Methods
Selective Update
Gate Scheduling
Selective update shortens effective gradient paths
Assumptions and norms
Proposition 1 (effective path length)
Effective depth scales with update rate
Backpropagation
Selective update creates ensembles of sub-RNNs
Implementation
Experiments
...and 25 more sections

Figures (13)

Figure 1: suRNN architecture and the selective update mechanism: (a) Transition of suRNN from time step $t-1$ to $t$: unlike conventional RNNs that apply a uniform, time-agnostic transition, suRNN adopts a per-neuron, time-dependent binary gate $\mathbf{g}_t \in \{0, 1\}^H$. This allows each neuron to dynamically choose whether to preserve or update its state at each step. (b) Selective-update mechanism for $H = 3$ neurons: the per-neuron gating logic functions as an RC circuit with a switch. When the switch is off ($\mathbf{g}_{t, i} = 0$, bottom path), neuron $\mathbf{h}_{t, i}$ bypasses the update and remains unchanged; when the switch is on ($\mathbf{g}_{t, i} = 1$, top path), neuron $\mathbf{h}_{t, i}$ undergoes a standard non-linear update.
Figure 2: Selective update improves long-range credit assignment.(a) RNN gradient profiles across delays $T$ (color bar; solid $=$ with selective gate, dashed $=$ without). Gating keeps gradients bounded (about $10^{-3}\!-\!10^{-2}$, log scale) with near-parallel decay; increasing $T$ mainly stretches the horizon, while ungated runs collapse to a tiny floor after early spikes. (b) Copying Memory at $T{=}5000$ (mean $\pm$ std over 3 seeds): both GRU and RNN with the gate converge faster and to lower loss, whereas ungated baselines plateau. These trends indicate the gate acts as an identity/skip path with per-step gain near $1$, preserving long-range gradients.
Figure 3: WikiText-103 language modeling learning curves showing training perplexity versus epochs for one-pass suGRU, a Hybrid variant, and a multi-head attention Transformer. The inset reports training loss over the same training run. The dashed horizontal line indicates the baseline transformer performance.
Figure 4: Spatio-temporal dynamics of a two-layer suGRU on sMNIST. Averaged gating activity (left) and hidden state changes $\Delta \mathbf{h}_t = \mathbf{h}_t - \mathbf{h}_{t-1}$ (right) are reshaped to $28 \times 28$ for spatial analysis. The signed increment maps demonstrate that the model performs sparse, event-triggered updates targeted at salient features in both layers, successfully bypassing redundant pixels.
Figure A1: Selective-update across recurrent families. Top: the update equations for vanilla RNN, GRU, and a spiking neuron model (SNN). Bottom: the same modules rewritten in selective-update form by inserting a per-unit gate $g_i[t]$. In GRU, it composes with the native continuous gate $z_i[t]$, and in SNN it acts on membrane integration. This exposes a unified timing control across architectures while leaving the underlying transforms unchanged.
...and 8 more figures

Efficient Sparse Selective-Update RNNs for Long-Range Sequence Modeling

TL;DR

Abstract

Efficient Sparse Selective-Update RNNs for Long-Range Sequence Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (13)