Table of Contents
Fetching ...

Design Principles for Sequence Models via Coefficient Dynamics

Jerome Sieber, Antonio Orvieto, Melanie N. Zeilinger, Carmen Amo Alonso

TL;DR

This work reframes sequence modeling by treating the outputs as linear combinations of past values whose coefficients arise from autonomous linear dynamical systems driven by impulse inputs. By unifying softmax attention, linear attention, and SSM/RNN-style dynamics under a single coefficient-dynamics framework, it derives six principled design guidelines—covering the readout map, evolution matrices, scaling, and normalization—that connect architectural choices to expressivity, efficiency, input selectivity, and training stability. The authors establish theoretical links between these components (e.g., the necessity of linear readouts for finite-memory recurrence, the role of $A_t$ in injecting positional information, the variance control via $b_j$, and the stabilizing effect of $\eta_i$), and validate the principles through extensive empirical tests on the MAD benchmark. Collectively, the framework enables principled, systematic design of new sequence-model architectures with predictable tradeoffs, beyond benchmark-driven experimentation.

Abstract

Deep sequence models, ranging from Transformers and State Space Models (SSMs) to more recent approaches such as gated linear RNNs, fundamentally compute outputs as linear combinations of past value vectors. To draw insights and systematically compare such architectures, we develop a unified framework that makes this output operation explicit, by casting the linear combination coefficients as the outputs of autonomous linear dynamical systems driven by impulse inputs. This viewpoint, in spirit substantially different from approaches focusing on connecting linear RNNs with linear attention, reveals a common mathematical theme across diverse architectures and crucially captures softmax attention, on top of RNNs, SSMs, and related models. In contrast to new model proposals that are commonly evaluated on benchmarks, we derive design principles linking architectural choices to model properties. Thereby identifying tradeoffs between expressivity and efficient implementation, geometric constraints on input selectivity, and stability conditions for numerically stable training and information retention. By connecting several insights and observations from recent literature, the framework both explains empirical successes of recent designs and provides guiding principles for systematically designing new sequence model architectures.

Design Principles for Sequence Models via Coefficient Dynamics

TL;DR

This work reframes sequence modeling by treating the outputs as linear combinations of past values whose coefficients arise from autonomous linear dynamical systems driven by impulse inputs. By unifying softmax attention, linear attention, and SSM/RNN-style dynamics under a single coefficient-dynamics framework, it derives six principled design guidelines—covering the readout map, evolution matrices, scaling, and normalization—that connect architectural choices to expressivity, efficiency, input selectivity, and training stability. The authors establish theoretical links between these components (e.g., the necessity of linear readouts for finite-memory recurrence, the role of in injecting positional information, the variance control via , and the stabilizing effect of ), and validate the principles through extensive empirical tests on the MAD benchmark. Collectively, the framework enables principled, systematic design of new sequence-model architectures with predictable tradeoffs, beyond benchmark-driven experimentation.

Abstract

Deep sequence models, ranging from Transformers and State Space Models (SSMs) to more recent approaches such as gated linear RNNs, fundamentally compute outputs as linear combinations of past value vectors. To draw insights and systematically compare such architectures, we develop a unified framework that makes this output operation explicit, by casting the linear combination coefficients as the outputs of autonomous linear dynamical systems driven by impulse inputs. This viewpoint, in spirit substantially different from approaches focusing on connecting linear RNNs with linear attention, reveals a common mathematical theme across diverse architectures and crucially captures softmax attention, on top of RNNs, SSMs, and related models. In contrast to new model proposals that are commonly evaluated on benchmarks, we derive design principles linking architectural choices to model properties. Thereby identifying tradeoffs between expressivity and efficient implementation, geometric constraints on input selectivity, and stability conditions for numerically stable training and information retention. By connecting several insights and observations from recent literature, the framework both explains empirical successes of recent designs and provides guiding principles for systematically designing new sequence model architectures.

Paper Structure

This paper contains 53 sections, 26 theorems, 73 equations, 13 figures, 12 tables.

Key Result

Lemma 1

A recurrent formulation of eqn:dynamics with finite memory (state) in $\mathbb{R}^{n \times d_v}$, which allows simultaneous computation of $\alpha_{i,j}$, exists if and only if $\phi(\cdot): \mathbb{R} \to \mathbb{R}$ is a linear map.

Figures (13)

  • Figure 1: (Principle 2) Performance of different readout maps $\phi(\cdot)$ on two MAD tasks against the fraction of coefficients with near zero values ($\lvert\alpha\rvert \leq 0.001$; left & middle) and the theoretical near-zero sets of each readout map ($\lvert\phi(x)\rvert \leq 0.001$; right). The other parameters $A_t$, $b_j$, $\eta_i$ are fixed, thus the setting for $^\dag$ is equivalent to softmax attention.
  • Figure 2: (Principle 3) Performance of two $A_t$ choices with and without positional embeddings (PE) on the noisy in-context recall task.
  • Figure 3: (Principle 4) Performance of four $A_t$ choices (scalar, diagonal, Householder with $k_t$, Householder with learned $z_t$) on two MAD tasks, with all other parameters fixed. The scalar/diagonal parameter(s) $\lambda_t$ are using either the GLA (gray) or Mamba-2 (magenta) parameterization and the Householder scaling $\beta_t$ is either fixed (gray) or learned (magenta).
  • Figure 4: Visual representation of output spaces referenced in Table \ref{['tab:linear_combinations']}.
  • Figure 5: (Principle 1) Computation time of a recurrent (magenta) and non-recurrent (gray) implementation against sequence length.
  • ...and 8 more figures

Theorems & Definitions (59)

  • Remark 1
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Example 1: Standard nonlinear readout maps
  • Corollary 2.1
  • proof
  • Corollary 2.2
  • proof
  • ...and 49 more