Table of Contents
Fetching ...

Bridging Expressivity and Scalability with Adaptive Unitary SSMs

Arjun Karuvally, Franz Nowak, Anderson T. Keller, Carmen Amo Alonso, Terrence J. Sejnowski, Hava T. Siegelmann

TL;DR

This work addresses the expressivity–scalability gap in long-sequence modeling by introducing the Adaptive Unitary State Space Model (AUSSM), which uses input-dependent skew-symmetric recurrence to yield unitary dynamics and rich temporal representations. Theoretical results show AUSSM can perform modulo counting and, when combined with Mamba, achieve maximal expressivity within diagonal SSMs, effectively realizing solvable regular languages. To scale this expressive power, the authors develop a separable convolution formulation and a CUDA kernel, reducing adaptive recurrence from quadratic to linear time/space and enabling practical training on long sequences. Empirically, AUSSM and the hybrid AUSSM+Mamba model outperform prior SSMs on algorithmic tasks and deliver strong performance on real-world long-time-series benchmarks, including state-of-the-art results on Weather forecasting. The work also draws connections between adaptive unitary dynamics and conserved neural trajectories, suggesting a robust inductive bias for both symbolic and continuous sequence modeling with potential broad impact on scalable temporal reasoning.

Abstract

Recent work has revealed that state space models (SSMs), while efficient for long-sequence processing, are fundamentally limited in their ability to represent formal languages-particularly due to time-invariant and real-valued recurrence structures. In this work, we draw inspiration from adaptive and structured dynamics observed in biological neural systems and introduce the Adaptive Unitary State Space Model (AUSSM), a novel class of SSMs that leverages skew-symmetric, input-dependent recurrence to achieve unitary evolution and high expressive power. Using algebraic automata theory, we prove that AUSSM can perform modulo counting and simulate solvable group automata at finite precision, enabling AUSSM to model a broad class of regular languages out of reach for other SSM architectures. To overcome the practical inefficiencies of adaptive recurrence, we develop a separable convolution formulation and a CUDA implementation that enables scalable parallel training. Empirically, we show that AUSSM and its hybrid variant-interleaved with Mamba-outperform prior SSMs on formal algorithmic tasks such as parity and modular arithmetic, and achieve competent performance on real-world long time-series classification benchmarks. Our results demonstrate that adaptive unitary recurrence provides a powerful and efficient inductive bias for both symbolic and continuous sequence modeling. The code is available at https://github.com/arjunkaruvally/AUSSM

Bridging Expressivity and Scalability with Adaptive Unitary SSMs

TL;DR

This work addresses the expressivity–scalability gap in long-sequence modeling by introducing the Adaptive Unitary State Space Model (AUSSM), which uses input-dependent skew-symmetric recurrence to yield unitary dynamics and rich temporal representations. Theoretical results show AUSSM can perform modulo counting and, when combined with Mamba, achieve maximal expressivity within diagonal SSMs, effectively realizing solvable regular languages. To scale this expressive power, the authors develop a separable convolution formulation and a CUDA kernel, reducing adaptive recurrence from quadratic to linear time/space and enabling practical training on long sequences. Empirically, AUSSM and the hybrid AUSSM+Mamba model outperform prior SSMs on algorithmic tasks and deliver strong performance on real-world long-time-series benchmarks, including state-of-the-art results on Weather forecasting. The work also draws connections between adaptive unitary dynamics and conserved neural trajectories, suggesting a robust inductive bias for both symbolic and continuous sequence modeling with potential broad impact on scalable temporal reasoning.

Abstract

Recent work has revealed that state space models (SSMs), while efficient for long-sequence processing, are fundamentally limited in their ability to represent formal languages-particularly due to time-invariant and real-valued recurrence structures. In this work, we draw inspiration from adaptive and structured dynamics observed in biological neural systems and introduce the Adaptive Unitary State Space Model (AUSSM), a novel class of SSMs that leverages skew-symmetric, input-dependent recurrence to achieve unitary evolution and high expressive power. Using algebraic automata theory, we prove that AUSSM can perform modulo counting and simulate solvable group automata at finite precision, enabling AUSSM to model a broad class of regular languages out of reach for other SSM architectures. To overcome the practical inefficiencies of adaptive recurrence, we develop a separable convolution formulation and a CUDA implementation that enables scalable parallel training. Empirically, we show that AUSSM and its hybrid variant-interleaved with Mamba-outperform prior SSMs on formal algorithmic tasks such as parity and modular arithmetic, and achieve competent performance on real-world long time-series classification benchmarks. Our results demonstrate that adaptive unitary recurrence provides a powerful and efficient inductive bias for both symbolic and continuous sequence modeling. The code is available at https://github.com/arjunkaruvally/AUSSM

Paper Structure

This paper contains 43 sections, 11 theorems, 49 equations, 3 figures, 6 tables.

Key Result

Theorem 1

Let $A: \mathbb{R} \to \mathbb{R}^{n \times n}$ be a smooth function such that $A(u)$ is skew-symmetric for all $u \in \mathbb{R}$. Then for each $u \in \mathbb{R}$, all eigenvalues of $A(u)$ lie on the imaginary axis, and the eigenvalues of the discrete-time transition matrix $\Phi(u) = \exp(\Delta

Figures (3)

  • Figure 1: (a) Existing practical SSM blocks like Mamba use fast parallel algorithms for computing the output, resulting in a tradeoff with expressivity. Non-diagonalizable Linear RNNs are the most expressive (in formal language terms) but lack scalable computational algorithms and suffer from gradient issues. AUSSM balances the expressivity-scalability tradeoff using a fully adaptive diagonal unitary recurrence. Fast SSMs with improved expressivity can be built by combining AUSSM with MAMBA blocks. (b) The AUSSM block uses the same block structure as Mamba Gu2023MambaLS, where the S6 SSM in Mamba is replaced with AUSSM. The main difference between AUSSM and S6 is the adaptive recurrence, where in the case of S6, $B$, $C$, and $\Delta$ are adaptive, whereas in AUSSM, $\Delta$ and $A$ are adaptive (see Section \ref{['section:AUSSM']} for details). AUSSM blocks can be used as drop-in replacements for existing SSM backbones to provide higher expressivity (see Section \ref{['section:theoryExpressivity']} for theoretical and Section \ref{['sec:experiments']} for experimental validation).
  • Figure 2: AUSSM with separable convolution achieves efficient runtime and memory scaling for fully adaptive SSMs. The runtime and peak memory usage of four implementations are compared: recurrent PyTorch AUSSM, separable PyTorch AUSSM, our optimized CUDA AUSSM kernel, and the Mamba CUDA kernel. (a) The AUSSM CUDA implementation outperforms both PyTorch baselines in speed and memory efficiency, and approaches the memory efficiency of Mamba despite AUSSM's full adaptive recurrence. Notably, the PyTorch implementation of the separable convolution has better runtime efficiency compared to the recurrent implementation, albeit at a higher memory cost. (b) The AUSSM CUDA kernel has a significantly lower memory footprint, identical to that of the partially adaptive and optimized Mamba CUDA kernel.
  • Figure 3: Space Complexity of SSM formulations: The figure illustrates an example convolution kernel for an SSM provided with four inputs at different timesteps ($u_t$). The convolution is visualized as a matrix multiplication operation over the input sequence. A. In LTI SSMs, the convolution kernel ($K_1, K_2, K_3, K_4$) is precomputed and applied to the input at different timesteps to obtain the output. B. In general LTV SSMs with time-varying recurrence, the convolution kernel has $O(L^2)$ elements, each unique to the input and output being considered at each timestep. The use of convolution in this scenario leads to quadratic complexity in space (akin to the transformers). C. In the separable convolution case, the quadratic matrix of the general SSM can actually be obtained by the outer product between $f_t$ for each timestep and the cumulative sums of a function $g_k$ independent of $t$. D. Computing the convolution kernel can be achieved in just an additional $O(2L)$ space.

Theorems & Definitions (29)

  • Theorem 1: Input-Modulated Rotation Frequencies via Skew-Symmetric Generator
  • Lemma 1
  • proof : Proof sketch
  • Lemma 2
  • proof : Proof sketch
  • Theorem 2
  • proof : Proof sketch
  • Lemma 3: Exponential of a Skew-Symmetric Matrix is Orthogonal
  • proof
  • Lemma 4: Marginal Stability of Discrete-Time Dynamics
  • ...and 19 more