Table of Contents
Fetching ...

The Curious Case of In-Training Compression of State Space Models

Makram Chahine, Philipp Nazari, Daniela Rus, T. Konstantin Rusch

TL;DR

This work targets the high computational cost of State Space Models (SSMs) for long-sequence tasks by introducing CompreSSM, an in-training model order reduction technique based on balanced truncation and Hankel singular values (HSVs). By tracking HSVs via controllability/observability Gramians during training and truncating low-importance state dimensions when their energy falls below a threshold, CompreSSM delivers substantial training speedups while largely preserving, or even improving, task performance. The method is demonstrated across diverse datasets with Linear Time-Invariant SSMs, including diagonal and selective variants, and is supported by Weyl’s perturbation theory to justify stable HSV ordering under training updates. The approach offers a principled, generalizable framework for efficient SSM training and points to extensions to linear time-varying systems and linear self-attention models, with code available publicly.

Abstract

State Space Models (SSMs), developed to tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference. At their core are recurrent dynamical systems that maintain a hidden state, with update costs scaling with the state dimension. A key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden. Control theory, and more specifically Hankel singular value analysis, provides a potent framework for the measure of energy for each state, as well as the balanced truncation of the original system down to a smaller representation with performance guarantees. Leveraging the eigenvalue stability properties of Hankel matrices, we apply this lens to SSMs \emph{during training}, where only dimensions of high influence are identified and preserved. Our approach, \textsc{CompreSSM}, applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models. Experiments show that in-training reduction significantly accelerates optimization while preserving expressivity, with compressed models retaining task-critical structure lost by models trained directly at smaller dimension. In other words, SSMs that begin large and shrink during training achieve computational efficiency while maintaining higher performance. Project code is available at github.com/camail-official/compressm.

The Curious Case of In-Training Compression of State Space Models

TL;DR

This work targets the high computational cost of State Space Models (SSMs) for long-sequence tasks by introducing CompreSSM, an in-training model order reduction technique based on balanced truncation and Hankel singular values (HSVs). By tracking HSVs via controllability/observability Gramians during training and truncating low-importance state dimensions when their energy falls below a threshold, CompreSSM delivers substantial training speedups while largely preserving, or even improving, task performance. The method is demonstrated across diverse datasets with Linear Time-Invariant SSMs, including diagonal and selective variants, and is supported by Weyl’s perturbation theory to justify stable HSV ordering under training updates. The approach offers a principled, generalizable framework for efficient SSM training and points to extensions to linear time-varying systems and linear self-attention models, with code available publicly.

Abstract

State Space Models (SSMs), developed to tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference. At their core are recurrent dynamical systems that maintain a hidden state, with update costs scaling with the state dimension. A key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden. Control theory, and more specifically Hankel singular value analysis, provides a potent framework for the measure of energy for each state, as well as the balanced truncation of the original system down to a smaller representation with performance guarantees. Leveraging the eigenvalue stability properties of Hankel matrices, we apply this lens to SSMs \emph{during training}, where only dimensions of high influence are identified and preserved. Our approach, \textsc{CompreSSM}, applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models. Experiments show that in-training reduction significantly accelerates optimization while preserving expressivity, with compressed models retaining task-critical structure lost by models trained directly at smaller dimension. In other words, SSMs that begin large and shrink during training achieve computational efficiency while maintaining higher performance. Project code is available at github.com/camail-official/compressm.

Paper Structure

This paper contains 24 sections, 3 theorems, 14 equations, 8 figures, 3 tables.

Key Result

Theorem 2.6

Any stable, minimal discrete LTI system admits a balanced realization, in which the controllability and observability Gramians coincide as ${\bm{W}} = \text{diag}({\bm{\sigma}}) = \text{diag}(\sigma_1, \dots, \sigma_n)$, with $\sigma_1 \geq \cdots \geq \sigma_n > 0$ called "Hankel singular values" (

Figures (8)

  • Figure 1: Overview of the proposed balanced truncation pipeline. The method applies at the level of the discrete linear dynamical systems inside SSM layers, independently of surrounding design choices such as projections, non-linearities, convolutions, or skip connections. Each dynamical system is isolated, balanced via its controllability and observability Gramians, and truncated according to Hankel singular values before being reinserted into the model.
  • Figure 2: In-training per-step analysis of Hankel singular value dynamics for a single LRU block with state dimension of 8 on the MNIST dataset for the first 25k steps. The leftmost plot shows the raw HSVs (as a set). The middle-left plot depicts the maximum absolute eigenvalue of $\delta {\bm{H}}$ as described in Section \ref{['sec:why']}. The middle-right plot overlays the maximum variation bound as an error margin around each HSV, with each shade now representing a highly probable path for a specific state dimension obtained by step by step linear sum assignment solving. The rightmost plot shows the relative contribution of the bottom $r$ HSVs to the total energy.
  • Figure 3: Subfigure (\ref{['fig:red-perf-cifar']}) shows the performance of different models trained on CIFAR10 as a function of the state dimension. Grey data indicates non-reduced models, and the shades of orange correspond to reduced models, with tolerance decreasing with redness. The circles represents the top-3 mean, while the star corresponds to the top-1 model. Subfigure (\ref{['fig:time-perf-cifar']}) shows top-3 performance versus the normalized average training time. Marker diameter is proportional to the final model order (also annotated) and in-between models are omitted for visual decluttering.
  • Figure 4: Test performance vs. final state dimension for all our experiments. Stars correspond to best performance, circles to the mean of the top-3 runs. Grey shapes correspond to non-reduced models, and the shades of orange to reduced models, with tolerance decreasing with redness.
  • Figure 5: Single LRU block with state dimension of 64 on the MNIST dataset.
  • ...and 3 more figures

Theorems & Definitions (5)

  • Definition 2.4: State Space Realization
  • Definition 2.5: Minimal/Balanced realizations
  • Theorem 2.6: approxDS.ch7
  • Theorem 2.7: Weyl1912DasAV
  • Lemma 3.1: Continuity of Hankel singular values under training updates