Table of Contents
Fetching ...

Provable Length Generalization in Sequence Prediction via Spectral Filtering

Annie Marsden, Evan Dogariu, Naman Agarwal, Xinyi Chen, Daniel Suo, Elad Hazan

TL;DR

A gradient-based learning algorithm is presented that provably achieves length generalization for linear dynamical systems and defines a new metric of performance in this setting -- the Asymmetric-Regret -- which measures regret against a benchmark predictor with longer context length than available to the learner.

Abstract

We consider the problem of length generalization in sequence prediction. We define a new metric of performance in this setting -- the Asymmetric-Regret -- which measures regret against a benchmark predictor with longer context length than available to the learner. We continue by studying this concept through the lens of the spectral filtering algorithm. We present a gradient-based learning algorithm that provably achieves length generalization for linear dynamical systems. We conclude with proof-of-concept experiments which are consistent with our theory.

Provable Length Generalization in Sequence Prediction via Spectral Filtering

TL;DR

A gradient-based learning algorithm is presented that provably achieves length generalization for linear dynamical systems and defines a new metric of performance in this setting -- the Asymmetric-Regret -- which measures regret against a benchmark predictor with longer context length than available to the learner.

Abstract

We consider the problem of length generalization in sequence prediction. We define a new metric of performance in this setting -- the Asymmetric-Regret -- which measures regret against a benchmark predictor with longer context length than available to the learner. We continue by studying this concept through the lens of the spectral filtering algorithm. We present a gradient-based learning algorithm that provably achieves length generalization for linear dynamical systems. We conclude with proof-of-concept experiments which are consistent with our theory.

Paper Structure

This paper contains 25 sections, 20 theorems, 173 equations, 8 figures, 4 algorithms.

Key Result

Theorem 1

Let $T \in \mathbb{Z}_{~\ge~ 0}$ and $q \in [0, 1]$. Consider a sequence $(y_1, \dots, y_T)$ generated by an unknown and noiseless linear dynamical system defined by matrices $(A,B,C,D)$ as per Eq. eqn:lds_equations. Assume the input sequence $u_{0:(t-1)}$ is sufficiently well-conditioned, satisfyin

Figures (8)

  • Figure 1: Regions of $[0, 1]$ not covered by Theorem \ref{['thm:lengthgeneralization']}, with $T$ on the x-axis. For convenience, in the right image we zoom in to $[0.999, 1]$.
  • Figure 2: The red region (Region B) represents the interval of eigenvalues for which length generalization is not guaranteed by our main theorem. The blue region (Region A) is chosen to hug Region B on both sides -- to be precise, the leftmost point of Region A is $0.9 \cdot \left(1 - {\log(T)}/({8T^{7/8}})\right)$, and the rightmost point is $1$. This selection ensures that (1) Region $A$ will start to contain bad eigenvalues as $q$ decreases from $7/8$ and (2) eigenvalues in Region B are bad for $q ~\le~ 7/8$.
  • Figure 3: Prediction losses $\ell_t(M^t, T^q)$ as a function of $t$ on an LDS with eigenvalues sampled from Region A, averaged over random seeds and smoothed.
  • Figure 4: Prediction losses $\ell_t(M^t, T^q)$ as a function of $t$ on an LDS with eigenvalues sampled from Region B, averaged over random seeds and smoothed.
  • Figure 5: Prediction losses $\ell_t(M^t, T^q)$ as a function of $t$ with two autoregressive components on an LDS with eigenvalues sampled from Region B, averaged over random seeds and smoothed. Contrast with Figure \ref{['fig:lds_bad']}.
  • ...and 3 more figures

Theorems & Definitions (36)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Definition 4: Asymmetric-Regret
  • Theorem 5: Simplified from hazan2017efficient
  • Theorem 6
  • Theorem 7
  • Definition 8: Tensorized Spectral Filters
  • Theorem 9
  • Theorem 10
  • ...and 26 more