Table of Contents
Fetching ...

DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products

Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, Riccardo Grazzi

TL;DR

DeltaProduct introduces a tunable, stable recurrence for linear RNNs by composing nh generalized Householder transformations per token, smoothly trading off expressivity and efficiency. Theoretical analysis shows its capability to solve group word problems and recognize regular languages, with practical benefits demonstrated in state-tracking tasks and language modeling, including strong length extrapolation. Empirical results indicate DeltaProduct outperforms DeltaNet in both state-tracking and LM benchmarks, while gating and scaling nh offer further gains. The work provides a scalable pathway between diagonal and dense state transitions and offers public code to reproduce results.

Abstract

Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive alternatives to Transformers for sequence modeling, offering efficient training and linear-time inference. However, existing architectures face a fundamental trade-off between expressivity and efficiency, dictated by the structure of their state-transition matrices. Diagonal matrices, used in models such as Mamba, GLA, or mLSTM, yield fast runtime but have limited expressivity. To address this, recent architectures such as DeltaNet and RWKV-7 adopted a diagonal plus rank--1 structure, which allows simultaneous token and channel mixing, improving associative recall and, as recently shown, state-tracking when allowing state-transition matrices to have negative eigenvalues. Building on the interpretation of DeltaNet's recurrence as performing one step of online gradient descent per token on an associative recall loss, we introduce DeltaProduct, which instead takes multiple ($n_h$) steps per token. This naturally leads to diagonal plus rank--$n_h$ state-transition matrices, formed as products of $n_h$ generalized Householder transformations, providing a tunable mechanism to balance expressivity and efficiency. We provide a detailed theoretical characterization of the state-tracking capability of DeltaProduct in finite precision, showing how it improves by increasing $n_h$. Our extensive experiments demonstrate that DeltaProduct outperforms DeltaNet in both state-tracking and language modeling, while also showing significantly improved length extrapolation capabilities.

DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products

TL;DR

DeltaProduct introduces a tunable, stable recurrence for linear RNNs by composing nh generalized Householder transformations per token, smoothly trading off expressivity and efficiency. Theoretical analysis shows its capability to solve group word problems and recognize regular languages, with practical benefits demonstrated in state-tracking tasks and language modeling, including strong length extrapolation. Empirical results indicate DeltaProduct outperforms DeltaNet in both state-tracking and LM benchmarks, while gating and scaling nh offer further gains. The work provides a scalable pathway between diagonal and dense state transitions and offers public code to reproduce results.

Abstract

Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive alternatives to Transformers for sequence modeling, offering efficient training and linear-time inference. However, existing architectures face a fundamental trade-off between expressivity and efficiency, dictated by the structure of their state-transition matrices. Diagonal matrices, used in models such as Mamba, GLA, or mLSTM, yield fast runtime but have limited expressivity. To address this, recent architectures such as DeltaNet and RWKV-7 adopted a diagonal plus rank--1 structure, which allows simultaneous token and channel mixing, improving associative recall and, as recently shown, state-tracking when allowing state-transition matrices to have negative eigenvalues. Building on the interpretation of DeltaNet's recurrence as performing one step of online gradient descent per token on an associative recall loss, we introduce DeltaProduct, which instead takes multiple () steps per token. This naturally leads to diagonal plus rank-- state-transition matrices, formed as products of generalized Householder transformations, providing a tunable mechanism to balance expressivity and efficiency. We provide a detailed theoretical characterization of the state-tracking capability of DeltaProduct in finite precision, showing how it improves by increasing . Our extensive experiments demonstrate that DeltaProduct outperforms DeltaNet in both state-tracking and language modeling, while also showing significantly improved length extrapolation capabilities.
Paper Structure (29 sections, 13 theorems, 41 equations, 22 figures, 6 tables)

This paper contains 29 sections, 13 theorems, 41 equations, 22 figures, 6 tables.

Key Result

Theorem 1

For any $n \in \mathbb{N}$ there exists a DeltaProduct model with one of the following configurations that can solve the word problem of the symmetric group $S_n$: (i) one layer with $n_h =n{-}1$grazzi-iclr25a (ii) 3 layers with $n_h{>}1$ (iii) 4 layers with $n_h = 1$. The construction for (ii) and

Figures (22)

  • Figure 1: (Left) DeltaProduct$_{n_h}$ learns higher-order permutation groups like $S_5$ in one layer, while DeltaNet ($n_h{=}1$) is limited to $S_2$ (parity). (Right) Length extrapolation of DeltaProduct improves significantly with higher $n_h$.
  • Figure 2: Overview of state-transition matrices ${\bm{A}}({\bm{x}}_i)$ in linear RNNs.
  • Figure 3: Two reflections produce a 2D rotation: Reflecting $x$ across planes $H_0$ and $H_1$ (with normals ${\bm{k}}_0$ and ${\bm{k}}_1$) yields a rotation by $2\theta$, where $\theta$ is the angle between the planes.
  • Figure 4: Training throughput of parameter matched 1.3B DeltaProduct$_{n_h}$ on a H100. Matched via: (Top) scaling the number of heads, (Bottom) scaling the head dimension.
  • Figure 5: Accuracy on state-tracking tasks for permutation groups $S_3$, $S_4$, $A_5$, and $S_5$, plotted against sequence length (x-axis). (Top row) Varying the number of Householder products $n_h$ for a single layer DeltaProduct$_{n_h}[-1,1]$. (Bottom row) Varying the number of layers $l$ of DeltaProduct$_{1}[-1,1]$/DeltaNet$[-1,1]$ (single Householder). Dashed vertical line at training context length 128. Higher $n_h$ improves extrapolation to longer sequences of permutations, e.g., $S_3$ can be learned with $n_h=2$ with a single layer while three layers are required when keeping $n_h=1$.
  • ...and 17 more figures

Theorems & Definitions (28)

  • Theorem 1
  • Theorem 2
  • Remark 1
  • Proposition 1
  • proof
  • Lemma 1
  • proof
  • Theorem 3: Restatement of \ref{['th:groups']}
  • proof
  • Lemma 2
  • ...and 18 more