Tuning Frequency Bias of State Space Models

Annan Yu; Dongwei Lyu; Soon Hoe Lim; Michael W. Mahoney; N. Benjamin Erichson

Tuning Frequency Bias of State Space Models

Annan Yu, Dongwei Lyu, Soon Hoe Lim, Michael W. Mahoney, N. Benjamin Erichson

TL;DR

It is shown that the initialization of an SSM assigns it an innate frequency bias and that training the model in a conventional way does not alter this bias, and proposed two mechanisms to tune frequency bias: either by scaling the initialization to tune the inborn frequency bias; or by applying a Sobolev-norm-based filter to adjust the sensitivity of the gradients to high-frequency inputs, which allows to change the frequency bias via training.

Abstract

State space models (SSMs) leverage linear, time-invariant (LTI) systems to effectively learn sequences with long-range dependencies. By analyzing the transfer functions of LTI systems, we find that SSMs exhibit an implicit bias toward capturing low-frequency components more effectively than high-frequency ones. This behavior aligns with the broader notion of frequency bias in deep learning model training. We show that the initialization of an SSM assigns it an innate frequency bias and that training the model in a conventional way does not alter this bias. Based on our theory, we propose two mechanisms to tune frequency bias: either by scaling the initialization to tune the inborn frequency bias; or by applying a Sobolev-norm-based filter to adjust the sensitivity of the gradients to high-frequency inputs, which allows us to change the frequency bias via training. Using an image-denoising task, we empirically show that we can strengthen, weaken, or even reverse the frequency bias using both mechanisms. By tuning the frequency bias, we can also improve SSMs' performance on learning long-range sequences, averaging an 88.26% accuracy on the Long-Range Arena (LRA) benchmark tasks.

Tuning Frequency Bias of State Space Models

TL;DR

Abstract

Paper Structure (19 sections, 5 theorems, 46 equations, 10 figures, 4 tables)

This paper contains 19 sections, 5 theorems, 46 equations, 10 figures, 4 tables.

Introduction
What is the Frequency Bias of an SSM?
Frequency Bias of an SSM at Initialization
Frequency Bias of an SSM during Training
Tuning the Frequency Bias of an SSM
Tuning Frequency Bias by Scaling the Initialization
Tuning Frequency Bias by a Sobolev Filter
Experiments and Discussions
Conclusion
Proofs
Functional Derivatives
Scaling Laws of the Initialization
More Numerical Experiments on the Illustrative Example
Details of the Experiments
Denoising Sequential Autoencoder
...and 4 more sections

Key Result

Lemma 1

Let $\tilde{\mathbf{G}}$ be the transfer function defined in eq.realG. Given any $B > \max_j |y_j|$, we have

Figures (10)

Figure 1: In a synthetic example to illustrate the frequency bias of SSMs, we form the inputs by superposing three waves of low, moderate, and high frequencies, respectively. We train an S4D model to regress the magnitudes of the three waves. We observe that the magnitudes of the low-frequency waves can be approximated much better compared to those of the high-frequency waves. In \ref{['fig:tunewaves']}, we show how to tune the frequency bias in this example.
Figure 2: The frequency bias of an SSM says that the frequency response has more variation in the low-frequency area than the high-frequency one.
Figure 3: We train an LTI system to learn a noisy bimodal target transfer function. The convergence to a local minimum depends on the initial location of the pole. Left: the ground truth contains a large mode and a small mode, plus some small noises. We want to investigate which mode, if any, our trainable LTI system converges to. Middle: we train the LTI system with respect to the $L^2$-loss. We show the trajectories of $(y(\tau), \xi(\tau))$ given different initializations $(y(0),\xi(0) = 3)$. The two local minima corresponding to the two modes of $\tilde{\mathbf{F}}$ are shown in red crosses. The green trajectories (initialized in Region I) converge to the mode at $y = -50$, the magenta trajectories (initialized in Region III) converge to the mode at $y = 50$, and the black ones (initialized in Region II) converge to neither. Right: the experiment is repeated with the $H^2$-loss (see \ref{['sec:tunesobolev']}).
Figure 4: The outputs of image-denoising S4D models trained with different configurations.
Figure 5: Two ablation studies of the tuning strategies proposed in this paper. We train an S4D model with varying parameters of $\alpha$ and $\beta$, respectively. On the left, we see that holding $\beta = 0$ (the default value), the model achieves its best performance when $\alpha = 4$; on the right, when we fix $\alpha = 1$ (the default value), the model performs the best when $\beta = -0.5$.
...and 5 more figures

Theorems & Definitions (10)

Lemma 1
Corollary 1
Theorem 1
Proposition 1
Theorem 2
proof : Proof of \ref{['lem.totalvariation']}
proof : Proof of \ref{['cor.HiPPObias']}
proof : Proof of \ref{['thm.trainingdynamics']}
proof : Proof of \ref{['thm.trainingdynamicsSob']}
proof : Proof of \ref{['prop.scalinglaw']}

Tuning Frequency Bias of State Space Models

TL;DR

Abstract

Tuning Frequency Bias of State Space Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (10)