Table of Contents
Fetching ...

Fixed-Point RNNs: Interpolating from Diagonal to Dense

Sajad Movahedi, Felix Sarnthein, Nicola Muca Cirone, Antonio Orvieto

TL;DR

This work introduces Fixed-Point RNNs (FP-RNNs) to bridge diagonal and dense linear RNNs by formulating dense transitions as fixed points of diagonal recurrences. The core idea is to parameterize a diagonal RNN with a contraction via Lambda_t and a channel mixer Q_t, then solve for the fixed point h^* (and, for matrix states, H_t^*) through fixed-point iterations, effectively trading parallelism for expressivity as needed. The FP-RNN framework is instantiated in FP-Mamba, which extends to matrix states with structured, input-dependent mixers (Diagonal+Low Rank, Householder, Kronecker) and a shifted hidden-state dependence to enhance copying and state tracking. Training leverages implicit differentiation to avoid storing the full fixed-point iterations, and theoretical results guarantee stable convergence under contractive conditions. Empirically, FP-Mamba achieves strong state-tracking on A5 and S5 and competitive copying performance, with adaptive iteration counts enabling longer sequence generalization while maintaining fixed parameter budgets. This framework offers a scalable path to highly expressive sequence mixers that can interpolate between fast diagonal processing and dense, memory-rich dynamics suitable for long-range dependencies.

Abstract

Linear recurrent neural networks (RNNs) and state-space models (SSMs) such as Mamba have become promising alternatives to softmax-attention as sequence mixing layers in Transformer architectures. Current models, however, do not exhibit the full state-tracking expressivity of RNNs because they rely on channel-wise (i.e. diagonal) sequence mixing. In this paper, we investigate parameterizations of a large class of dense linear RNNs as fixed-points of parallelizable diagonal linear RNNs. The resulting models can naturally trade expressivity for efficiency at a fixed number of parameters and achieve state-of-the-art results on the state-tracking benchmarks $A_5$ and $S_5$, while matching performance on copying and other tasks.

Fixed-Point RNNs: Interpolating from Diagonal to Dense

TL;DR

This work introduces Fixed-Point RNNs (FP-RNNs) to bridge diagonal and dense linear RNNs by formulating dense transitions as fixed points of diagonal recurrences. The core idea is to parameterize a diagonal RNN with a contraction via Lambda_t and a channel mixer Q_t, then solve for the fixed point h^* (and, for matrix states, H_t^*) through fixed-point iterations, effectively trading parallelism for expressivity as needed. The FP-RNN framework is instantiated in FP-Mamba, which extends to matrix states with structured, input-dependent mixers (Diagonal+Low Rank, Householder, Kronecker) and a shifted hidden-state dependence to enhance copying and state tracking. Training leverages implicit differentiation to avoid storing the full fixed-point iterations, and theoretical results guarantee stable convergence under contractive conditions. Empirically, FP-Mamba achieves strong state-tracking on A5 and S5 and competitive copying performance, with adaptive iteration counts enabling longer sequence generalization while maintaining fixed parameter budgets. This framework offers a scalable path to highly expressive sequence mixers that can interpolate between fast diagonal processing and dense, memory-rich dynamics suitable for long-range dependencies.

Abstract

Linear recurrent neural networks (RNNs) and state-space models (SSMs) such as Mamba have become promising alternatives to softmax-attention as sequence mixing layers in Transformer architectures. Current models, however, do not exhibit the full state-tracking expressivity of RNNs because they rely on channel-wise (i.e. diagonal) sequence mixing. In this paper, we investigate parameterizations of a large class of dense linear RNNs as fixed-points of parallelizable diagonal linear RNNs. The resulting models can naturally trade expressivity for efficiency at a fixed number of parameters and achieve state-of-the-art results on the state-tracking benchmarks and , while matching performance on copying and other tasks.

Paper Structure

This paper contains 55 sections, 5 theorems, 39 equations, 16 figures, 4 tables.

Key Result

Theorem 3.1

Let $f_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{h})$ be the diagonal linear RNN with input-independent $\mathbf{\Lambda}$ and $\mathbf{Q}$ If $||\mathbf{\Lambda}||_2 < 1$ and $||\mathbf{I}-\mathbf{Q}||_2 < 1$, then $f_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{h})$ has a Lipschitz constant $<1$ in $\mathbf{h}$. Proof in App. app:proof-fixed-point.

Figures (16)

  • Figure 1: Sequence length generalization at training length 16 (pink) for state-tracking on $A_5$, with Transformer (brown) and LSTM (purple) as lower/upper bounds. Our Fixed-Point RNN (FP-Mamba-H) is trained at different maximum number of fixed-point iterations $\ell_{\text{max}}$: between $2$ (green) and $16$ (blue). Increasing the number of fixed-point iterations allows the linear RNN to interpolate from diagonal to dense in a few iterations.
  • Figure 1: Effect of shifted hidden state dependence $\mathbf{y}_{t-1}^{\ell-1}$ on copying at $\times2$ length generalization. Each column determines which input-dependent component of the recurrence in Eq. \ref{['eqn:fp-mamba-iteration']} also depends on $\mathbf{y}_{t-1}^{\ell-1}$. Performance is unlocked by including a hidden dependence for $\mathbf{b}_t$ and $\mathbf{c}_t$.
  • Figure 2: (a) State-tracking on $A_5$ at sequence length $16$, and (b) character accuracy of copying at $2\times$ sequence length generalization, trained on lengths $\in [5, 50]$. Our single layer FP-Mamba-H with mixer reflections $r\in \{1,2,4\}$ is compared to baselines of increasing depth $\in \{1,2,4,6,8\}$. FP-Mamba-H is the only model capable of solving both the state-tracking and the copy task.
  • Figure 3: (a) An overview of the proposed Fixed-Point RNN framework in Sec. \ref{['sec:fixed-point-rnn']}. A diagonal RNN $f_{\boldsymbol{\theta}}$ consisting of a sequence mixer $\mathbf{\Lambda}_t$ and a channel mixer $\mathbf{Q}_t$ is iterated until convergence towards the hidden states of an implicitly dense RNN $F_{\boldsymbol{\theta}}$. (b) FP-RNN variants with channel mixer introduced in Sec. \ref{['sec:parametrization-Q-Lambda']} and \ref{['sec:algorithmic-implications']} solve the state-tracking task $A_5$ up to various sequence lengths. (c) FP-RNNs adapt their computation time to the difficulty of the task by varying the number of fixed-point iterations $\ell^*$ .
  • Figure 4: Length generalization on $A_5$(a, c) and $S_5$(b, d) beyond the train sequence length $16$ (pink line). We compare a 1-layer FP-Mamba with mixer variants $\mathbf{Q}_t$ to baselines with 2 layers.
  • ...and 11 more figures

Theorems & Definitions (9)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem B.1
  • Theorem B.1
  • Definition F.1
  • Proposition F.2
  • proof
  • Remark F.3
  • Remark F.4