Table of Contents
Fetching ...

Task-Level Insights from Eigenvalues across Sequence Models

Rahel Rickenbach, Jelena Trisovic, Alexandre Didier, Jerome Sieber, Melanie N. Zeilinger

TL;DR

This work tackles the scalability bottleneck of softmax attention by analyzing eigenvalue spectra within a unified dynamical-systems framework (DSF) to compare attention-based models and linear state-space models (SSMs). By mapping masked attention and linear alternatives to a discrete-time LPV dynamical system, it links eigenvalue placement to memory and long-range dependency, revealing task-driven spectral signatures such as clusters near $1$ for long-memory tasks and near $0$ for selective memory. The empirical study across multiple benchmarks shows how architectural choices (gating, convolution, layer depth, and normalization) reshape the eigenvalue spectra and correspondingly affect performance; Mamba-2 sits between pure SSMs and attention, balancing memory and selectivity. The findings establish eigenvalue analysis as a principled, task-aware metric to guide initialization and architectural design, potentially informing spectral-aware training and model selection for long-context sequence modeling.

Abstract

Although softmax attention drives state-of-the-art performance for sequence models, its quadratic complexity limits scalability, motivating linear alternatives such as state space models (SSMs). While these alternatives improve efficiency, their fundamental differences in information processing remain poorly understood. In this work, we leverage the recently proposed dynamical systems framework to represent softmax, norm and linear attention as dynamical systems, enabling a structured comparison with SSMs by analyzing their respective eigenvalue spectra. Since eigenvalues capture essential aspects of dynamical system behavior, we conduct an extensive empirical analysis across diverse sequence models and benchmarks. We first show that eigenvalues influence essential aspects of memory and long-range dependency modeling, revealing spectral signatures that align with task requirements. Building on these insights, we then investigate how architectural modifications in sequence models impact both eigenvalue spectra and task performance. This correspondence further strengthens the position of eigenvalue analysis as a principled metric for interpreting, understanding, and ultimately improving the capabilities of sequence models.

Task-Level Insights from Eigenvalues across Sequence Models

TL;DR

This work tackles the scalability bottleneck of softmax attention by analyzing eigenvalue spectra within a unified dynamical-systems framework (DSF) to compare attention-based models and linear state-space models (SSMs). By mapping masked attention and linear alternatives to a discrete-time LPV dynamical system, it links eigenvalue placement to memory and long-range dependency, revealing task-driven spectral signatures such as clusters near for long-memory tasks and near for selective memory. The empirical study across multiple benchmarks shows how architectural choices (gating, convolution, layer depth, and normalization) reshape the eigenvalue spectra and correspondingly affect performance; Mamba-2 sits between pure SSMs and attention, balancing memory and selectivity. The findings establish eigenvalue analysis as a principled, task-aware metric to guide initialization and architectural design, potentially informing spectral-aware training and model selection for long-context sequence modeling.

Abstract

Although softmax attention drives state-of-the-art performance for sequence models, its quadratic complexity limits scalability, motivating linear alternatives such as state space models (SSMs). While these alternatives improve efficiency, their fundamental differences in information processing remain poorly understood. In this work, we leverage the recently proposed dynamical systems framework to represent softmax, norm and linear attention as dynamical systems, enabling a structured comparison with SSMs by analyzing their respective eigenvalue spectra. Since eigenvalues capture essential aspects of dynamical system behavior, we conduct an extensive empirical analysis across diverse sequence models and benchmarks. We first show that eigenvalues influence essential aspects of memory and long-range dependency modeling, revealing spectral signatures that align with task requirements. Building on these insights, we then investigate how architectural modifications in sequence models impact both eigenvalue spectra and task performance. This correspondence further strengthens the position of eigenvalue analysis as a principled metric for interpreting, understanding, and ultimately improving the capabilities of sequence models.

Paper Structure

This paper contains 34 sections, 5 equations, 26 figures, 2 tables.

Figures (26)

  • Figure 1: Eigenvalue distributions for one head, across models, selected layers, and tasks. Bars show the percentage of eigenvalues within discretized ranges (chosen to emphasize eigenvalues near zero and near one). Light and dark bars indicate the distribution at initialization and after training, respectively. Error bars denote standard deviation across input sequences. Model performance, measured as perplexity for WikiText (lower is better) and percentage of correct output sequences for the other tasks (higher is better), is indicated in parentheses. Complete plots for all layers, heads and multiple seeds are provided in Appendix \ref{['app:additional_results']}.
  • Figure 2: Comparison of the effects of gating and convolution on the eigenvalue spectra for one head, across selected tasks, models, and layers. Complete plots for all layers and tasks are provided in Appendix \ref{['app:additional_results']}.
  • Figure 3: Eigenvalue distribution comparison for single-layer Mamba-2 and softmax attention models with and without convolution on MQAR. Results for one out of four heads are shown.
  • Figure 4: Eigenvalue distributions on CIFAR-10 and ListOps for one of the heads for: (left, with white background) norm attention with convolution and different normalization functions; (right, with grey background) Mamba-2 as LTI. Complete plots for all layers and tasks are provided in Appendix \ref{['app:additional_results']}.
  • Figure 5: Plot legend of all subsequent figures.
  • ...and 21 more figures