Length independent generalization bounds for deep SSM architectures via Rademacher contraction and stability constraints

Dániel Rácz; Mihály Petreczky; Bálint Daróczy

Length independent generalization bounds for deep SSM architectures via Rademacher contraction and stability constraints

Dániel Rácz, Mihály Petreczky, Bálint Daróczy

TL;DR

This work addresses the challenge of generalization in deep State-Space Model (SSM) architectures operating on long sequences by deriving a sequence-length independent PAC bound. Central to the approach is the Rademacher Contraction (RC) framework, which bounds the Rademacher complexity of deep SSMs by their stability-driven norms (notably $H_2$ and $\ell_1$ norms) and a controlled composition of RC blocks. The main result shows that, under mild assumptions and stability constraints, the generalization gap scales as $O(1/\sqrt{N})$ with a bound that does not depend on the sequence length $T$, though it may grow with depth unless contraction holds. This provides theoretical justification for using stability-enforced SSM blocks (as in S4/S5/LRU) and offers a principled, architecture-agnostic perspective on why deep SSMs generalize well on long-range data. The framework yields a practical interpretation of stability as a mechanism that controls generalization, and it sets the stage for tighter bounds and extensions to broader dynamical architectures.

Abstract

Many state-of-the-art models trained on long-range sequences, for example S4, S5 or LRU, are made of sequential blocks combining State-Space Models (SSMs) with neural networks. In this paper we provide a PAC bound that holds for these kind of architectures with \emph{stable} SSM blocks and does not depend on the length of the input sequence. Imposing stability of the SSM blocks is a standard practice in the literature, and it is known to help performance. Our results provide a theoretical justification for the use of stable SSM blocks as the proposed PAC bound decreases as the degree of stability of the SSM blocks increases.

Length independent generalization bounds for deep SSM architectures via Rademacher contraction and stability constraints

TL;DR

and

norms) and a controlled composition of RC blocks. The main result shows that, under mild assumptions and stability constraints, the generalization gap scales as

with a bound that does not depend on the sequence length

, though it may grow with depth unless contraction holds. This provides theoretical justification for using stability-enforced SSM blocks (as in S4/S5/LRU) and offers a principled, architecture-agnostic perspective on why deep SSMs generalize well on long-range data. The framework yields a practical interpretation of stability as a mechanism that controls generalization, and it sets the stage for tighter bounds and extensions to broader dynamical architectures.

Abstract

Paper Structure (15 sections, 11 theorems, 46 equations, 3 figures, 2 tables)

This paper contains 15 sections, 11 theorems, 46 equations, 3 figures, 2 tables.

Introduction
Related work
Informal statement of the result
Formal problem setup
Deep SSMs
Assumptions
Main results
Numerical example
Conclusions
Related work on PAC-Bayesian bounds, on finite sample bounds and on PAC bounds for non i.i.d. data
Deep SSM architectures
Rademacher complexity
Rademacher Contractions in the literature
Proofs
Numerical example

Key Result

Theorem 3.2

Let $\mathcal{F}$ be a set of deep SSM models with stable SSM blocks, which satisfy a number of mild regularity assumptions. There exist constants $K_l$, and $K_{\mathcal{F}}$ which depend only on the model class $\mathcal{F}$, such that for any time horizon $T > 0$, any confidence level $\delta > 0

Figures (3)

Figure 1: Dataset containing two classes of spiral curves.
Figure 2: Upper bound on the true loss by taking the empirical loss and the bounding term from Theorem \ref{['thm:maingeneral']} for various values of $N$.
Figure 3: Behavior of the bound on the true loss during learning.

Theorems & Definitions (31)

Remark 3.1: Selective SSMs
Theorem 3.2: Informal theorem
Corollary 3.3
Lemma 4.1: chellaboina1999induced
Definition 4.2: MLP layer
Definition 4.3: GLU layer S5
Definition 4.4
Definition 4.5: encoder, decoder, pooling
Definition 4.6
Definition 5.1: $(\mu, c)$-Rademacher Contraction
...and 21 more

Length independent generalization bounds for deep SSM architectures via Rademacher contraction and stability constraints

TL;DR

Abstract

Length independent generalization bounds for deep SSM architectures via Rademacher contraction and stability constraints

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (31)