The Curious Case of In-Training Compression of State Space Models
Makram Chahine, Philipp Nazari, Daniela Rus, T. Konstantin Rusch
TL;DR
This work targets the high computational cost of State Space Models (SSMs) for long-sequence tasks by introducing CompreSSM, an in-training model order reduction technique based on balanced truncation and Hankel singular values (HSVs). By tracking HSVs via controllability/observability Gramians during training and truncating low-importance state dimensions when their energy falls below a threshold, CompreSSM delivers substantial training speedups while largely preserving, or even improving, task performance. The method is demonstrated across diverse datasets with Linear Time-Invariant SSMs, including diagonal and selective variants, and is supported by Weyl’s perturbation theory to justify stable HSV ordering under training updates. The approach offers a principled, generalizable framework for efficient SSM training and points to extensions to linear time-varying systems and linear self-attention models, with code available publicly.
Abstract
State Space Models (SSMs), developed to tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference. At their core are recurrent dynamical systems that maintain a hidden state, with update costs scaling with the state dimension. A key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden. Control theory, and more specifically Hankel singular value analysis, provides a potent framework for the measure of energy for each state, as well as the balanced truncation of the original system down to a smaller representation with performance guarantees. Leveraging the eigenvalue stability properties of Hankel matrices, we apply this lens to SSMs \emph{during training}, where only dimensions of high influence are identified and preserved. Our approach, \textsc{CompreSSM}, applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models. Experiments show that in-training reduction significantly accelerates optimization while preserving expressivity, with compressed models retaining task-critical structure lost by models trained directly at smaller dimension. In other words, SSMs that begin large and shrink during training achieve computational efficiency while maintaining higher performance. Project code is available at github.com/camail-official/compressm.
