Table of Contents
Fetching ...

SpectraLDS: Provable Distillation for Linear Dynamical Systems

Devan Shah, Shlomo Fortgang, Sofiia Druchyna, Elad Hazan

TL;DR

SpectraLDS presents a provable distillation from spectral-transform-based STU filters to an explicit symmetric LDS, enabling constant-time per-token inference while preserving long-range predictive power. The method builds a bridge between convex spectral learning and recurrent LDS representations, providing a transformation that converts STU filters into LDS parameters with provable, exponentially decaying approximation error in the number of filters. The approach achieves strong practical impact by enabling scalable, accurate language modeling with per-token costs that do not grow with sequence length, and experiments show near-identical performance to STU baselines with substantial speedups. This work unifies spectral filtering advantages with recurrent inference, offering a principled pathway to efficient, memory-rich sequence modeling in real-world applications.

Abstract

We present the first provable method for identifying symmetric linear dynamical systems (LDS) with accuracy guarantees that are independent of the systems' state dimension or effective memory. Our approach builds upon recent work that represents symmetric LDSs as convolutions learnable via fixed spectral transformations. We show how to invert this representation, thereby recovering an LDS model from its spectral transform and yielding an end-to-end convex optimization procedure. This distillation preserves predictive accuracy while enabling constant-time and constant-space inference per token, independent of sequence length. We evaluate our method, SpectraLDS, as a component in sequence prediction architectures and demonstrate that accuracy is preserved while inference efficiency is improved on tasks such as language modeling.

SpectraLDS: Provable Distillation for Linear Dynamical Systems

TL;DR

SpectraLDS presents a provable distillation from spectral-transform-based STU filters to an explicit symmetric LDS, enabling constant-time per-token inference while preserving long-range predictive power. The method builds a bridge between convex spectral learning and recurrent LDS representations, providing a transformation that converts STU filters into LDS parameters with provable, exponentially decaying approximation error in the number of filters. The approach achieves strong practical impact by enabling scalable, accurate language modeling with per-token costs that do not grow with sequence length, and experiments show near-identical performance to STU baselines with substantial speedups. This work unifies spectral filtering advantages with recurrent inference, offering a principled pathway to efficient, memory-rich sequence modeling in real-world applications.

Abstract

We present the first provable method for identifying symmetric linear dynamical systems (LDS) with accuracy guarantees that are independent of the systems' state dimension or effective memory. Our approach builds upon recent work that represents symmetric LDSs as convolutions learnable via fixed spectral transformations. We show how to invert this representation, thereby recovering an LDS model from its spectral transform and yielding an end-to-end convex optimization procedure. This distillation preserves predictive accuracy while enabling constant-time and constant-space inference per token, independent of sequence length. We evaluate our method, SpectraLDS, as a component in sequence prediction architectures and demonstrate that accuracy is preserved while inference efficiency is improved on tasks such as language modeling.

Paper Structure

This paper contains 41 sections, 1 theorem, 32 equations, 12 figures, 7 tables, 3 algorithms.

Key Result

Theorem 1

As long as $h ~\ge~ k$, Algorithm alg:spectralds returns w.h.p. a matrix $\widetilde{M}$ such that where $\lambda_{\max}$ is the largest eigenvalue of the Penrose-Moore pseudo inverse of the matrix $M$.

Figures (12)

  • Figure 1: Comparison of SpectraLDS and other methods learning an arbitrary symmetric LDS with and without noise. The shaded region shows the $95\%$ confidence interval over $8$ runs. Each model leveraged default configurations except the LDS, which required a lower learning rate to converge. More details are available in Appendix \ref{['app:learning_lds']}.
  • Figure 2: Fit of Spectral Filters by an LDS of state dimension 80 where x-axis represents the time domain. Filters are normalized to be comparable. The blue shading on the middle figure represents a filter quickly alternating (negative eigenvalue). A complete comparison for $k = 24$ without normalization is provided in Appendix \ref{['app:fit_with_lds_80']}.
  • Figure 3: Runtime for generating sequences of increasing length across STU implementations. The naive convolution approach exhibits quadratic growth, the FutureFill variants show logarithmic growth, and the distilled STU-to-LDS layers achieve linear growth. The STU-Only Epoched Future Fill OOMs for the largest sequence length. As shown in the rightmost figure, the SpectraLDS models have nearly identical runtime despite varied state dimension. More results are available in \ref{['app:layerspeed']}.
  • Figure 4: Largest Singular Value as we increase $h$.
  • Figure 5: $\lambda_{max} \cdot h$ as we increase $h$.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof : Proof of Theorem \ref{['thm:Theorem1']}