Table of Contents
Fetching ...

Improved state mixing in higher-order and block diagonal linear recurrent networks

Igor Dubinin, Antonio Orvieto, Felix Effenberger

TL;DR

This work targets the expressivity–efficiency gap in linear recurrent networks (LRNNs) caused by diagonal state transitions. It introduces Higher-order Linear Recurrent Units ($m$-th order recurrence) and Block-Diagonal LRUs (BD-LRU) to enable richer time and channel mixing, coupled with a joint $L1$-normalization of gates to stabilize training. A parallel-scan implementation preserves throughput for moderate block/window sizes, enabling scalable long-sequence processing. Empirically, BD-LRU often matches or surpasses linear state-space models and LSTMs on synthetic tasks and language modeling, while H-LRU excels in parameter-efficient compression; overall, the state-mixing structure—not width alone—drives expressivity in LRNNs.

Abstract

Linear recurrent networks (LRNNs) and linear state space models (SSMs) promise computational and memory efficiency on long-sequence modeling tasks, yet their diagonal state transitions limit expressivity. Dense and nonlinear architectures (e.g., LSTMs) on the other hand are provably more expressive, but computationally costly. Here, we explore how expressivity in LRNNs can be increased via richer state mixing across time and channels while maintaining competitive efficiency. Specifically, we introduce two structured LRNN architectures: (i) Higher-order Linear Recurrent Units (H-LRU), which generalize first-order recurrence to higher order, mixing multiple past states, and (ii) Block-Diagonal LRUs (BD-LRU), which enable dense intra-block channel mixing. Per-channel (H-LRU) or per-row (BD-LRU) L1-normalization of selective gates stabilizes training and allows for scaling window/block sizes. A parallel-scan implementation of the proposed architectures keeps the throughput competitive with diagonal LRNNs for moderate orders (H-LRU) and block sizes (BD-LRU). In synthetic sequence modeling tasks, the performance of BD-LRU matches or exceeds those of linear SSMs (Mamba), low-rank LRNNs (DeltaNet) and LSTM baselines, while H-LRU is found to be the most parameter-efficient in compression task. In both synthetic sequence modeling and language modeling, our results indicate that the structure of state mixing rather than width alone shapes expressivity of LRNNs, offering a practical route to closing the efficiency-expressivity gap in linear sequence models.

Improved state mixing in higher-order and block diagonal linear recurrent networks

TL;DR

This work targets the expressivity–efficiency gap in linear recurrent networks (LRNNs) caused by diagonal state transitions. It introduces Higher-order Linear Recurrent Units (-th order recurrence) and Block-Diagonal LRUs (BD-LRU) to enable richer time and channel mixing, coupled with a joint -normalization of gates to stabilize training. A parallel-scan implementation preserves throughput for moderate block/window sizes, enabling scalable long-sequence processing. Empirically, BD-LRU often matches or surpasses linear state-space models and LSTMs on synthetic tasks and language modeling, while H-LRU excels in parameter-efficient compression; overall, the state-mixing structure—not width alone—drives expressivity in LRNNs.

Abstract

Linear recurrent networks (LRNNs) and linear state space models (SSMs) promise computational and memory efficiency on long-sequence modeling tasks, yet their diagonal state transitions limit expressivity. Dense and nonlinear architectures (e.g., LSTMs) on the other hand are provably more expressive, but computationally costly. Here, we explore how expressivity in LRNNs can be increased via richer state mixing across time and channels while maintaining competitive efficiency. Specifically, we introduce two structured LRNN architectures: (i) Higher-order Linear Recurrent Units (H-LRU), which generalize first-order recurrence to higher order, mixing multiple past states, and (ii) Block-Diagonal LRUs (BD-LRU), which enable dense intra-block channel mixing. Per-channel (H-LRU) or per-row (BD-LRU) L1-normalization of selective gates stabilizes training and allows for scaling window/block sizes. A parallel-scan implementation of the proposed architectures keeps the throughput competitive with diagonal LRNNs for moderate orders (H-LRU) and block sizes (BD-LRU). In synthetic sequence modeling tasks, the performance of BD-LRU matches or exceeds those of linear SSMs (Mamba), low-rank LRNNs (DeltaNet) and LSTM baselines, while H-LRU is found to be the most parameter-efficient in compression task. In both synthetic sequence modeling and language modeling, our results indicate that the structure of state mixing rather than width alone shapes expressivity of LRNNs, offering a practical route to closing the efficiency-expressivity gap in linear sequence models.
Paper Structure (29 sections, 1 theorem, 21 equations, 8 figures, 5 tables)

This paper contains 29 sections, 1 theorem, 21 equations, 8 figures, 5 tables.

Key Result

Proposition 1

Consider either the H-LRU or the BD-LRU architectures, written in matrix form as shown in Equations eq:hlru_canon and eq:bdlru. If for any $k\in[1,N]$, the $k$-th recurrent non-diagonal block $\mathbf{h}_{t}^k = \mathbf{A}^k_t \times \mathbf{h}^k_{t-1} + \mathbf{a}^k_{0,t} \odot \mathbf{v}^k_t$ is s

Figures (8)

  • Figure 1: Structure and performance of the proposed H-LRU and BD-LRU architectures. A. A schematic illustration of the theoretically predicted trade-off between expressivity and efficiency of block-diagonal linear recurrent networks. B. Schematic illustration of the gating mechanisms in block-diagonal form, showing both the state gates that constitute the state-transition matrix and the input gates that act on external inputs. The structure of the gates' selectivity is color-coded: white squares indicate fixed zero gates, black squares indicate fixed identity gates, other colors indicate active selective gates; similar color palettes indicates row-wise normalization. C. Summary of the performance of the proposed and the baseline models. The x-axis indicates the number of FLOPs per recurrent step. The y-axis denotes the mean test accuracy over all considered synthetic tasks (compression, selective copying, in context recall, permutation) of the overall best performing model configuration (hidden size up to 6k). Optimal hidden sizes vary between models, see also Fig. \ref{['fig:mad_scaling']}. Note that H-LRU and BD-LRU can achieve better or matching performance than both linear and non-linear baselines while requiring fewer FLOPs per recurrent step. Diagonal LRU presents the best results across both H-LRU m1 and BD-LRU m1, which are identical models for $m=1$. D. Best performance for different window sizes $m$ (H-LRU) and block sizes $m$ (BD-LRU).
  • Figure 2: Scaling of performance with window/block size on the compression task for L1 normalization with different parameterizations. Results are shown for different window/block sizes $m$ of the higher-order LRU (H-LRU) and block diagonal LRU (BD-LRU). A. Comparison between H-LRUs. B. Comparison between BD-LRUs.
  • Figure 3: Eigenvalue spectra of the transition matrices learned by BD-LRU on the $S5$ dataset. BD-LRU exhibits negative eigenvalues starting from $m=2$ and complex eigenvalues from $m=3$. Other configurations are reported in Appendix \ref{['a:eigen']}.
  • Figure 4: Language modeling results on FineWeb. A. Best achieved perplexity for 210M parameter BD-LRU models (trained on 10B tokens) across varying learning rates. B. Performance scaling of BD-LRU models on 2.5B tokens with varying hidden dimensions. Results indicate that moderate block sizes provide a superior inductive bias. C. Runtime and perplexity comparison for 140M parameter models. While H-LRUs are parameter-efficient, matching the parameter budget of a BD-LRU requires increasing the H-LRU hidden dimension by a factor of m, making them substantially more costly to scale.
  • Figure 5: Model throughput on the selective copying task. (A) Comparison of sequential, higher-order parallel, and autotuned higher-order parallel implementations of BD-LRUs with 128 blocks and with a sequence length of 2048, illustrating advantage of parallel scan implementation and the trade-off between expressivity and efficiency. BD-LRU is shown for illustration purposes only, but H-LRU employs the same parallel scan implementation. (B) Comparison for layers with hidden size of 768 and accordingly adjusted number of blocks. Note that trade-off between expressivity and efficiency increases over longer sequences. (C) Throughput comparison of parameter-matched layers ($\sim$33M parameters). Number of blocks is adjusted to ensure consistent model sizes across architectures. BD-LRU achieves throughput competitive with other LRNN baselines. Notably, larger block sizes demonstrate higher practical efficiency despite increased theoretical complexity, due to superior utilization of GPU hardware operations.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Proposition 1