Table of Contents
Fetching ...

In-Context Learning of Linear Dynamical Systems with Transformers: Approximation Bounds and Depth-Separation

Frank Cole, Yuxuan Zhao, Yulong Lu, Tianhao Zhang

TL;DR

This work analyzes in-context learning by transformers for noisy linear dynamical systems under non-IID data. It proves that deep linear transformers with depth $L=O(\log T)$ can closely track the least-squares predictor with a uniform $L^2$ testing loss decaying as $O(\log(T)/T)$, by unrolling a Richardson iterative solver. Conversely, a lower bound shows that single-layer linear transformers incur a non-vanishing test loss, revealing a depth-separation phenomenon and highlighting differences between IID and non-IID data environments. Numerical experiments corroborate the theory, illustrating the benefits of depth for in-context learning in dynamical settings and the qualitative gap between IID and non-IID analyses. Overall, the paper provides rigorous approximation guarantees for deep architectures and fundamental limits for shallow ones, offering guidance for designing transformers to learn correlated sequential dynamics.

Abstract

This paper investigates approximation-theoretic aspects of the in-context learning capability of the transformers in representing a family of noisy linear dynamical systems. Our first theoretical result establishes an upper bound on the approximation error of multi-layer transformers with respect to an $L^2$-testing loss uniformly defined across tasks. This result demonstrates that transformers with logarithmic depth can achieve error bounds comparable with those of the least-squares estimator. In contrast, our second result establishes a non-diminishing lower bound on the approximation error for a class of single-layer linear transformers, which suggests a depth-separation phenomenon for transformers in the in-context learning of dynamical systems. Moreover, this second result uncovers a critical distinction in the approximation power of single-layer linear transformers when learning from IID versus non-IID data.

In-Context Learning of Linear Dynamical Systems with Transformers: Approximation Bounds and Depth-Separation

TL;DR

This work analyzes in-context learning by transformers for noisy linear dynamical systems under non-IID data. It proves that deep linear transformers with depth can closely track the least-squares predictor with a uniform testing loss decaying as , by unrolling a Richardson iterative solver. Conversely, a lower bound shows that single-layer linear transformers incur a non-vanishing test loss, revealing a depth-separation phenomenon and highlighting differences between IID and non-IID data environments. Numerical experiments corroborate the theory, illustrating the benefits of depth for in-context learning in dynamical settings and the qualitative gap between IID and non-IID analyses. Overall, the paper provides rigorous approximation guarantees for deep architectures and fundamental limits for shallow ones, offering guidance for designing transformers to learn correlated sequential dynamics.

Abstract

This paper investigates approximation-theoretic aspects of the in-context learning capability of the transformers in representing a family of noisy linear dynamical systems. Our first theoretical result establishes an upper bound on the approximation error of multi-layer transformers with respect to an -testing loss uniformly defined across tasks. This result demonstrates that transformers with logarithmic depth can achieve error bounds comparable with those of the least-squares estimator. In contrast, our second result establishes a non-diminishing lower bound on the approximation error for a class of single-layer linear transformers, which suggests a depth-separation phenomenon for transformers in the in-context learning of dynamical systems. Moreover, this second result uncovers a critical distinction in the approximation power of single-layer linear transformers when learning from IID versus non-IID data.

Paper Structure

This paper contains 23 sections, 16 theorems, 147 equations, 1 figure.

Key Result

Theorem 1

There exists a transformer of depth $L = O(\log(T))$ with parameters $\theta$ such that when $T$ is sufficiently large. The implicit constants depend on $\sigma$, $w_{\max}$, and $d$, and the dependence is at most polynomial in $d.$

Figures (1)

  • Figure 1: Test error as a function of training epochs, with $d = 1$, $T = 500$, $[w_{\min},w_{\max}] = [0,0.8]$, and various values of $L$

Theorems & Definitions (28)

  • Theorem 1
  • Proposition 1
  • Theorem 2
  • Lemma 1
  • Lemma 2
  • proof : Proof of Lemma \ref{['richardsonunrolling']}
  • Lemma 3
  • Lemma 4: matni2019tutorial, Theorem 4.2
  • proof : Proof of Theorem \ref{['approxerrordeeptf']}
  • Lemma 5
  • ...and 18 more