Table of Contents
Fetching ...

Length independent generalization bounds for deep SSM architectures via Rademacher contraction and stability constraints

Dániel Rácz, Mihály Petreczky, Bálint Daróczy

TL;DR

This work addresses the challenge of generalization in deep State-Space Model (SSM) architectures operating on long sequences by deriving a sequence-length independent PAC bound. Central to the approach is the Rademacher Contraction (RC) framework, which bounds the Rademacher complexity of deep SSMs by their stability-driven norms (notably $H_2$ and $\ell_1$ norms) and a controlled composition of RC blocks. The main result shows that, under mild assumptions and stability constraints, the generalization gap scales as $O(1/\sqrt{N})$ with a bound that does not depend on the sequence length $T$, though it may grow with depth unless contraction holds. This provides theoretical justification for using stability-enforced SSM blocks (as in S4/S5/LRU) and offers a principled, architecture-agnostic perspective on why deep SSMs generalize well on long-range data. The framework yields a practical interpretation of stability as a mechanism that controls generalization, and it sets the stage for tighter bounds and extensions to broader dynamical architectures.

Abstract

Many state-of-the-art models trained on long-range sequences, for example S4, S5 or LRU, are made of sequential blocks combining State-Space Models (SSMs) with neural networks. In this paper we provide a PAC bound that holds for these kind of architectures with \emph{stable} SSM blocks and does not depend on the length of the input sequence. Imposing stability of the SSM blocks is a standard practice in the literature, and it is known to help performance. Our results provide a theoretical justification for the use of stable SSM blocks as the proposed PAC bound decreases as the degree of stability of the SSM blocks increases.

Length independent generalization bounds for deep SSM architectures via Rademacher contraction and stability constraints

TL;DR

This work addresses the challenge of generalization in deep State-Space Model (SSM) architectures operating on long sequences by deriving a sequence-length independent PAC bound. Central to the approach is the Rademacher Contraction (RC) framework, which bounds the Rademacher complexity of deep SSMs by their stability-driven norms (notably and norms) and a controlled composition of RC blocks. The main result shows that, under mild assumptions and stability constraints, the generalization gap scales as with a bound that does not depend on the sequence length , though it may grow with depth unless contraction holds. This provides theoretical justification for using stability-enforced SSM blocks (as in S4/S5/LRU) and offers a principled, architecture-agnostic perspective on why deep SSMs generalize well on long-range data. The framework yields a practical interpretation of stability as a mechanism that controls generalization, and it sets the stage for tighter bounds and extensions to broader dynamical architectures.

Abstract

Many state-of-the-art models trained on long-range sequences, for example S4, S5 or LRU, are made of sequential blocks combining State-Space Models (SSMs) with neural networks. In this paper we provide a PAC bound that holds for these kind of architectures with \emph{stable} SSM blocks and does not depend on the length of the input sequence. Imposing stability of the SSM blocks is a standard practice in the literature, and it is known to help performance. Our results provide a theoretical justification for the use of stable SSM blocks as the proposed PAC bound decreases as the degree of stability of the SSM blocks increases.
Paper Structure (15 sections, 11 theorems, 46 equations, 3 figures, 2 tables)

This paper contains 15 sections, 11 theorems, 46 equations, 3 figures, 2 tables.

Key Result

Theorem 3.2

Let $\mathcal{F}$ be a set of deep SSM models with stable SSM blocks, which satisfy a number of mild regularity assumptions. There exist constants $K_l$, and $K_{\mathcal{F}}$ which depend only on the model class $\mathcal{F}$, such that for any time horizon $T > 0$, any confidence level $\delta > 0

Figures (3)

  • Figure 1: Dataset containing two classes of spiral curves.
  • Figure 2: Upper bound on the true loss by taking the empirical loss and the bounding term from Theorem \ref{['thm:maingeneral']} for various values of $N$.
  • Figure 3: Behavior of the bound on the true loss during learning.

Theorems & Definitions (31)

  • Remark 3.1: Selective SSMs
  • Theorem 3.2: Informal theorem
  • Corollary 3.3
  • Lemma 4.1: chellaboina1999induced
  • Definition 4.2: MLP layer
  • Definition 4.3: GLU layer S5
  • Definition 4.4
  • Definition 4.5: encoder, decoder, pooling
  • Definition 4.6
  • Definition 5.1: $(\mu, c)$-Rademacher Contraction
  • ...and 21 more