In-Context Learning of Linear Dynamical Systems with Transformers: Approximation Bounds and Depth-Separation
Frank Cole, Yuxuan Zhao, Yulong Lu, Tianhao Zhang
TL;DR
This work analyzes in-context learning by transformers for noisy linear dynamical systems under non-IID data. It proves that deep linear transformers with depth $L=O(\log T)$ can closely track the least-squares predictor with a uniform $L^2$ testing loss decaying as $O(\log(T)/T)$, by unrolling a Richardson iterative solver. Conversely, a lower bound shows that single-layer linear transformers incur a non-vanishing test loss, revealing a depth-separation phenomenon and highlighting differences between IID and non-IID data environments. Numerical experiments corroborate the theory, illustrating the benefits of depth for in-context learning in dynamical settings and the qualitative gap between IID and non-IID analyses. Overall, the paper provides rigorous approximation guarantees for deep architectures and fundamental limits for shallow ones, offering guidance for designing transformers to learn correlated sequential dynamics.
Abstract
This paper investigates approximation-theoretic aspects of the in-context learning capability of the transformers in representing a family of noisy linear dynamical systems. Our first theoretical result establishes an upper bound on the approximation error of multi-layer transformers with respect to an $L^2$-testing loss uniformly defined across tasks. This result demonstrates that transformers with logarithmic depth can achieve error bounds comparable with those of the least-squares estimator. In contrast, our second result establishes a non-diminishing lower bound on the approximation error for a class of single-layer linear transformers, which suggests a depth-separation phenomenon for transformers in the in-context learning of dynamical systems. Moreover, this second result uncovers a critical distinction in the approximation power of single-layer linear transformers when learning from IID versus non-IID data.
