Table of Contents
Fetching ...

Autocorrelation Matters: Understanding the Role of Initialization Schemes for State Space Models

Fusheng Liu, Qianxiao Li

TL;DR

This work addresses initialization for state-space models (SSMs) by centering analysis on input sequence autocorrelation rather than purely HiPPO-based priors. It shows that the model timescale $\Delta$ should be chosen in light of the data autocorrelation spectrum, that allowing $\Re(W)=0$ can dramatically extend memory without sacrificing initialization stability, and that the imaginary parts of the state matrix $W$ govern optimization conditioning while introducing a tradeoff between approximation and estimation when dominant frequencies are closely spaced. The authors provide theoretical bounds linking $\Delta$ to sequence length $L$ via $\lambda_{\max}(\mathbb{E}[xx^\top])$, establish conditions under which zero real parts improve memory, and derive bounds on the Gram matrix spectrum to explain conditioning benefits of complex-valued SSMs, complemented by experiments on copying tasks, decorrelated sequential MNIST, and Long Range Arena. Together, these results offer a data-driven initialization framework for SSMs that improves stability, memory, and optimization efficiency in fixed-length sequence tasks, with practical implications for long-range modeling across vision, time series, and language processing. Key ideas include the dependence on data autocorrelation for $\Delta$, the stabilizing yet memory-enhancing role of $\Re(W)=0$, and the conditioning benefits and tradeoffs introduced by the imaginary parts of $W$.

Abstract

Current methods for initializing state space model (SSM) parameters primarily rely on the HiPPO framework \citep{gu2023how}, which is based on online function approximation with the SSM kernel basis. However, the HiPPO framework does not explicitly account for the effects of the temporal structures of input sequences on the optimization of SSMs. In this paper, we take a further step to investigate the roles of SSM initialization schemes by considering the autocorrelation of input sequences. Specifically, we: (1) rigorously characterize the dependency of the SSM timescale on sequence length based on sequence autocorrelation; (2) find that with a proper timescale, allowing a zero real part for the eigenvalues of the SSM state matrix mitigates the curse of memory while still maintaining stability at initialization; (3) show that the imaginary part of the eigenvalues of the SSM state matrix determines the conditioning of SSM optimization problems, and uncover an approximation-estimation tradeoff when training SSMs with a specific class of target functions.

Autocorrelation Matters: Understanding the Role of Initialization Schemes for State Space Models

TL;DR

This work addresses initialization for state-space models (SSMs) by centering analysis on input sequence autocorrelation rather than purely HiPPO-based priors. It shows that the model timescale should be chosen in light of the data autocorrelation spectrum, that allowing can dramatically extend memory without sacrificing initialization stability, and that the imaginary parts of the state matrix govern optimization conditioning while introducing a tradeoff between approximation and estimation when dominant frequencies are closely spaced. The authors provide theoretical bounds linking to sequence length via , establish conditions under which zero real parts improve memory, and derive bounds on the Gram matrix spectrum to explain conditioning benefits of complex-valued SSMs, complemented by experiments on copying tasks, decorrelated sequential MNIST, and Long Range Arena. Together, these results offer a data-driven initialization framework for SSMs that improves stability, memory, and optimization efficiency in fixed-length sequence tasks, with practical implications for long-range modeling across vision, time series, and language processing. Key ideas include the dependence on data autocorrelation for , the stabilizing yet memory-enhancing role of , and the conditioning benefits and tradeoffs introduced by the imaginary parts of .

Abstract

Current methods for initializing state space model (SSM) parameters primarily rely on the HiPPO framework \citep{gu2023how}, which is based on online function approximation with the SSM kernel basis. However, the HiPPO framework does not explicitly account for the effects of the temporal structures of input sequences on the optimization of SSMs. In this paper, we take a further step to investigate the roles of SSM initialization schemes by considering the autocorrelation of input sequences. Specifically, we: (1) rigorously characterize the dependency of the SSM timescale on sequence length based on sequence autocorrelation; (2) find that with a proper timescale, allowing a zero real part for the eigenvalues of the SSM state matrix mitigates the curse of memory while still maintaining stability at initialization; (3) show that the imaginary part of the eigenvalues of the SSM state matrix determines the conditioning of SSM optimization problems, and uncover an approximation-estimation tradeoff when training SSMs with a specific class of target functions.

Paper Structure

This paper contains 15 sections, 9 theorems, 37 equations, 12 figures, 3 tables.

Key Result

Theorem 1

Consider a ZOH discretized SSM (eq: zoh ssm) with timescale $\Delta > 0$ and $\Re(w_j) \leq 0$ for $j = 1,\ldots,m$. Suppose that the input sequence $(x_0,\ldots,x_{L-1})$ is sampled from a unknown distribution in $\mathbb{R}^L$, and the read-out vector $c$ is from i.i.d. standard normal distributio where $\lambda_{\max}(\cdot)$ represents the maximal eigenvalue.

Figures (12)

  • Figure 1: (Left) Training a diagonal SSM (\ref{['eq: zoh ssm']}) on a copying task using i.i.d. data with a dimension of $128$. We vary the minimal timescale $\Delta_{\min} = 1/L, 1/\sqrt{L}$ and the maximal timescale $\Delta_{\max} = 1/L, 1/\sqrt{L}, 0.1$ w.r.t. sequence length $L$. (Middle) The maximal eigenvalue of the autocorrelation matrix $\mathbb{E}[xx^\top]$ on different random processes of $x$. (Right) The maximal eigenvalue of $\mathbb{E}[xx^\top]$ on sequential image datasets sMNIST and sCIFAR10 with different resize rates varied from $0.5$ to $4$.
  • Figure 2: The expected magnitude of the SSM output value on synthetic sequences with different autocorrelation. The real part $\Re(w) = -0.5$ follows the common practice and we consider four dependencies between the timescale $\Delta$ and the sequence length $L$.
  • Figure 3: The expected magnitude of the SSM output value on synthetic sequences with different autocorrelation and different dependencies between $\Delta$ and $L$. The real part $\Re(w)$ is set to be zero.
  • Figure 4: The expected magnitude of the SSM output value on sequential image datasets with different resize rates (ranging from $0.5$ to $4$) and different dependencies between $\Delta$ and $L$.
  • Figure 5: (Left) Training a diagonal SSM (\ref{['eq: zoh ssm']}) on a task that requires long-term memory. The learned memory function $\tilde{\rho}$ effectively captures the spike in long-range dependencies. However, it struggles to do so when the real part is negative. (Middle) Test loss on the long-term memory task when initializing $\Re(w) = 0$ and $\Re(w) = -0.5$. (Right) Test accuracy for training a diagonal SSM on decorrelated sequential MNIST dataset with different real parts at initialization.
  • ...and 7 more figures

Theorems & Definitions (18)

  • Remark 1
  • Theorem 1
  • Remark 2
  • Proposition 1
  • Theorem 2
  • Lemma 1
  • proof
  • Lemma 2: Itô’s isometry
  • Lemma 3: Gershgorin circle theorem
  • Lemma 4
  • ...and 8 more