Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues

Antonio Orvieto; Soham De; Caglar Gulcehre; Razvan Pascanu; Samuel L. Smith

Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues

Antonio Orvieto, Soham De, Caglar Gulcehre, Razvan Pascanu, Samuel L. Smith

TL;DR

This work analyzes the expressivity of architectures that compose linear diagonal RNNs with time-invariant, position-wise MLPs for sequence-to-sequence tasks. It proves a finite-width universality result: with width $N\ge \dim(\mathcal{V})$ for the RNN and width $D \gtrsim \mathcal{O}(L/\epsilon^2)$ for the MLP, the model can approximate any sufficiently regular causal map to within error $\epsilon$, by first losslessly encoding the input in the RNN and then non-linearly processing it with the MLP. A key technical contribution is using Barron-function theory to bound the MLP width needed to interpolate a sequence of time-indexed functions, together with a Vandermonde-based argument for input memorization via complex eigenvalues. The paper also discusses the practical benefits of complex eigenvalues near the unit circle for conditioning and memory, draws connections to HiPPO, and validates the theory with experiments on reconstruction and ODE-driven sequences. Overall, the results justify the design of modern SSM-based models and provide principled guidance on initialization and the role of complex-valued recurrences for long-range sequence modeling.

Abstract

Deep neural networks based on linear RNNs interleaved with position-wise MLPs are gaining traction as competitive approaches for sequence modeling. Examples of such architectures include state-space models (SSMs) like S4, LRU, and Mamba: recently proposed models that achieve promising performance on text, genetics, and other data that require long-range reasoning. Despite experimental evidence highlighting these architectures' effectiveness and computational efficiency, their expressive power remains relatively unexplored, especially in connection to specific choices crucial in practice - e.g., carefully designed initialization distribution and potential use of complex numbers. In this paper, we show that combining MLPs with both real or complex linear diagonal recurrences leads to arbitrarily precise approximation of regular causal sequence-to-sequence maps. At the heart of our proof, we rely on a separation of concerns: the linear RNN provides a lossless encoding of the input sequence, and the MLP performs non-linear processing on this encoding. While we show that real diagonal linear recurrences are enough to achieve universality in this architecture, we prove that employing complex eigenvalues near unit disk - i.e., empirically the most successful strategy in S4 - greatly helps the RNN in storing information. We connect this finding with the vanishing gradient issue and provide experiments supporting our claims.

Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues

TL;DR

for the RNN and width

for the MLP, the model can approximate any sufficiently regular causal map to within error

, by first losslessly encoding the input in the RNN and then non-linearly processing it with the MLP. A key technical contribution is using Barron-function theory to bound the MLP width needed to interpolate a sequence of time-indexed functions, together with a Vandermonde-based argument for input memorization via complex eigenvalues. The paper also discusses the practical benefits of complex eigenvalues near the unit circle for conditioning and memory, draws connections to HiPPO, and validates the theory with experiments on reconstruction and ODE-driven sequences. Overall, the results justify the design of modern SSM-based models and provide principled guidance on initialization and the role of complex-valued recurrences for long-range sequence modeling.

Abstract

Paper Structure (43 sections, 14 theorems, 43 equations, 16 figures)

This paper contains 43 sections, 14 theorems, 43 equations, 16 figures.

Introduction
Results on SSMs expressivity.
Contributions.
Preliminaries
Diagonal Linear RNNs.
MLPs and universality.
Universality Result
Linear RNNs can perfectly memorize inputs
Main idea
Multidimensional setting.
RNNs can compress inputs, when possible
General definition of $\boldsymbol{\Omega_k}$.
$\boldsymbol{\Psi}$ may be unknown.
Comparison with HiPPO
MLP Reconstruction
...and 28 more sections

Key Result

Theorem 1

Consider $g(x)$ parametrized by a 1HL-MLP (with $D$ hidden neurons): $g(x) = \sum_{k=1}^D \tilde{c}_k\sigma(\langle \tilde{a}_k , x\rangle + \tilde{b}_k)+ \tilde{c}_0$, where $\sigma$ is any sigmoidal function$\lim_{x\to-\infty} \sigma(x) = 0$ and $\lim_{x\to\infty} \sigma(x) = 1$.. Let $f:\mathbb{R

Figures (16)

Figure 1: Illustration of a Linear RNN + position-wise MLP on flattened MNIST lecun1998mnist digits. In our construction, the role of the linear RNN is to compress (if possible) and store the input sequence into the hidden state: from hidden states one can recover past tokens using a linear transformation (see §\ref{['sec:vandermonde']}). As the hidden state size $N$ increases, the reconstructions becomes more and more faithful. The MLP (same for all tokens) takes this representation as input, and is able to reproduce the action of any sufficiently regular sequence-to-sequence model (see §\ref{['sec:MLP']}). We provide additional insights and a thorough experimental evaluation and discussion in §\ref{['sec:discussion']}.
Figure 2: Effect of eigenvalue magnitude on conditioning. $L = 128$, $N = 2L =256$, $\lambda_i\sim \mathbb{T}[r_{\min}, r_{\max}]$.
Figure 3: Reconstruction of MNIST digits from the final linear RNN hidden state using the Vandermonde inverse. For $r_{\min}=0$, the Vandermonde is ill-conditioned (Fig. \ref{['fig:vander_study']}) and hence only the recent past can be reconstructed. For $r_{min}=0.99$ we can reconstruct the whole image. See Fig. \ref{['fig:MNIST_PF_rec_MLP']} for results on learned reconstructions.
Figure 4: Reconstruction of MNIST digits and PathFinder data tay2020long Using a trained RNN + MLP or linear reconstruction decoder from the last hidden state$x_L$. Plotted is average L2 pixel-wise norm (mean of 3 runs). All parameters are trained, hyperparameters are tuned. $r_{\min} = 0.9$, $r_{\max}=0.999$ are found to be best for initialization of the RNN (in line with gu2021efficiently). For large hidden dimension, linear reconstruction is successful. For smaller hidden dimension, non-linear reconstruction becomes necessary.
Figure 5: Trained linear RNN + MLP architecture (1 layer) learns a non-linear sequence-to-sequence map. Additional experiments can be found in the appendix.
...and 11 more figures

Theorems & Definitions (20)

Definition 1: Sequence-to-sequence
Definition 2: Linear RNN
Definition 3: Barron function
Theorem 1: Universality of 1HL-MLPs
Remark 1: Multidimensional Output
Theorem 2: Universality
Remark 2
Theorem 3: Power of Linear RNNs, informal
Proposition 1: Bijectivity
Proposition 2: MLP, single timestamp
...and 10 more

Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues

TL;DR

Abstract

Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (20)