HOPE for a Robust Parameterization of Long-memory State Space Models

Annan Yu; Michael W. Mahoney; N. Benjamin Erichson

HOPE for a Robust Parameterization of Long-memory State Space Models

Annan Yu, Michael W. Mahoney, N. Benjamin Erichson

TL;DR

A new parameterization scheme is developed, called HOPE, for LTI systems that utilizes Markov parameters within Hankel operators that efficiently implement these innovations by nonuniformly sampling the transfer functions of LTI systems, and they require fewer parameters compared to canonical SSMs.

Abstract

State-space models (SSMs) that utilize linear, time-invariant (LTI) systems are known for their effectiveness in learning long sequences. To achieve state-of-the-art performance, an SSM often needs a specifically designed initialization, and the training of state matrices is on a logarithmic scale with a very small learning rate. To understand these choices from a unified perspective, we view SSMs through the lens of Hankel operator theory. Building upon it, we develop a new parameterization scheme, called HOPE, for LTI systems that utilizes Markov parameters within Hankel operators. Our approach helps improve the initialization and training stability, leading to a more robust parameterization. We efficiently implement these innovations by nonuniformly sampling the transfer functions of LTI systems, and they require fewer parameters compared to canonical SSMs. When benchmarked against HiPPO-initialized models such as S4 and S4D, an SSM parameterized by Hankel operators demonstrates improved performance on Long-Range Arena (LRA) tasks. Moreover, our new parameterization endows the SSM with non-decaying memory within a fixed time window, which is empirically corroborated by a sequential CIFAR-10 task with padded noise.

HOPE for a Robust Parameterization of Long-memory State Space Models

TL;DR

Abstract

Paper Structure (24 sections, 7 theorems, 76 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 24 sections, 7 theorems, 76 equations, 10 figures, 2 tables, 1 algorithm.

Introduction
Preliminaries
Unravel a Mystery: Hankel Singular Values in Initialization and Training
Many LTI Systems Have Low Ranks
LTI Systems are Numerically Unstable under Perturbations
HOPE-SSM: A Rankful, Stable, and Long-memory Parameterization
Experiments and Discussions
Conclusion
More background on the LTI system and Hankel operators
More background on Hankel singular values
Hankel singular values from balanced realization
Hankel singular values as a rational approximation problem
Three Different Initialization Schemes
Proof of \ref{['thm.lowrank']}
Proof of \ref{['thm.perturbS4D']}
...and 9 more sections

Key Result

Theorem 1

Given any $\epsilon > 0$, $0 < \alpha \leq 1$, and $0 < \delta \leq 1$, with probability at least $1 - \delta$, the $\epsilon$-rank of $\overline{\Gamma} = (\overline{\mathbf{A}},\overline{\mathbf{B}},\overline{\mathbf{C}},\overline{\mathbf{D}})$ with $a_j \sim F_a$ i.i.d. and $b_jc_j \sim \mathcal{ and the constant in $\mathcal{O}$ is universal.

Figures (10)

Figure 1: There are many equivalent ways to represent an LTI system. While most of the canonical SSMs use continuous LTI systems as their parameters, we propose to parameterize an SSM by the Markov parameters in its Hankel operator. The feedthrough matrix $\mathbf{D}$ is not shown in the diagram, but it is also a parameter of the LTI layers in both the canonical SSMs and our HOPE-SSM.
Figure 2: Test accuracy of the SSMs on the sCIFAR task. The LTI systems are initialized in three different ways and are either trained or untrained. We notice that when the LTI systems are initialized with $\texttt{init}_1$ (red), training the LTI system together with other model parameters is impairing the model accuracy. This is in contrast to SSMs initialized with $\texttt{init}_3$ (blue), where assigning the LTI system a small positive learning rate is helping the performance.
Figure 3: The distribution of all relative Hankel singular values $\sigma_j(\mathbf{H}) / \sigma_1(\mathbf{H})$ of the LTI systems in an SSM. For each initialization, the distribution is shown both at initialization and after the SSM is trained for $10$ epochs. Note that the second row only applies when the LTI systems are not frozen.
Figure 4: A random perturbation to the imaginary part of $\mathbf{A}$ is added to a system from $\texttt{init}_1$ and a HiPPO-LegS system from $\texttt{init}_3$. The magnitude of the perturbation is set to $0.1\%$ and $1\%$ of the original matrix $\mathbf{A}$. For each system, on the left, we show the relative Hankel singular values $\sigma_j/\sigma_1$ of the original and perturbed systems; on the right, we plot the location of each $a_j$ in the complex plane and use the color to indicate the magnitude of its associated $|b_jc_j|$.
Figure 5: The test accuracy of the HOPE-SSM on the sCIFAR-10 task and the evolution of the Hankel singular values of the model. The plots are to be compared with \ref{['fig:mystery']} and \ref{['fig:hsvds']}.
...and 5 more figures

Theorems & Definitions (14)

Theorem 1
Theorem 2
Theorem 3
Theorem 4
Lemma 1
proof
Lemma 2
proof
Lemma 3
proof
...and 4 more

HOPE for a Robust Parameterization of Long-memory State Space Models

TL;DR

Abstract

HOPE for a Robust Parameterization of Long-memory State Space Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (14)