Table of Contents
Fetching ...

Can a Transformer Represent a Kalman Filter?

Gautam Goel, Peter Bartlett

TL;DR

This work proves that a Transformer can perform Kalman Filtering in linear dynamical systems. By showing that softmax self-attention implements Gaussian kernel smoothing and that this estimator strongly approximates the Kalman Filter, the authors construct an explicit Transformer Filter achieving $oldsymbol{ abla}$-level accuracy uniformly in time, with a controllable temperature parameter $eta$. They further demonstrate that this Transformer-based filtering can be embedded in measurement-feedback control to closely approximate the LQG (and $H_\infty$) controllers, yielding near-optimal costs and a form of weak stabilization in closed-loop. The results provide a formal bridge between deep sequence models and classic state-space estimation/control, with precise dimension bounds and stability considerations, and extend naturally to robust control settings.

Abstract

Transformers are a class of autoregressive deep learning architectures which have recently achieved state-of-the-art performance in various vision, language, and robotics tasks. We revisit the problem of Kalman Filtering in linear dynamical systems and show that Transformers can approximate the Kalman Filter in a strong sense. Specifically, for any observable LTI system we construct an explicit causally-masked Transformer which implements the Kalman Filter, up to a small additive error which is bounded uniformly in time; we call our construction the Transformer Filter. Our construction is based on a two-step reduction. We first show that a softmax self-attention block can exactly represent a Nadaraya-Watson kernel smoothing estimator with a Gaussian kernel. We then show that this estimator closely approximates the Kalman Filter. We also investigate how the Transformer Filter can be used for measurement-feedback control and prove that the resulting nonlinear controllers closely approximate the performance of standard optimal control policies such as the LQG controller.

Can a Transformer Represent a Kalman Filter?

TL;DR

This work proves that a Transformer can perform Kalman Filtering in linear dynamical systems. By showing that softmax self-attention implements Gaussian kernel smoothing and that this estimator strongly approximates the Kalman Filter, the authors construct an explicit Transformer Filter achieving -level accuracy uniformly in time, with a controllable temperature parameter . They further demonstrate that this Transformer-based filtering can be embedded in measurement-feedback control to closely approximate the LQG (and ) controllers, yielding near-optimal costs and a form of weak stabilization in closed-loop. The results provide a formal bridge between deep sequence models and classic state-space estimation/control, with precise dimension bounds and stability considerations, and extend naturally to robust control settings.

Abstract

Transformers are a class of autoregressive deep learning architectures which have recently achieved state-of-the-art performance in various vision, language, and robotics tasks. We revisit the problem of Kalman Filtering in linear dynamical systems and show that Transformers can approximate the Kalman Filter in a strong sense. Specifically, for any observable LTI system we construct an explicit causally-masked Transformer which implements the Kalman Filter, up to a small additive error which is bounded uniformly in time; we call our construction the Transformer Filter. Our construction is based on a two-step reduction. We first show that a softmax self-attention block can exactly represent a Nadaraya-Watson kernel smoothing estimator with a Gaussian kernel. We then show that this estimator closely approximates the Kalman Filter. We also investigate how the Transformer Filter can be used for measurement-feedback control and prove that the resulting nonlinear controllers closely approximate the performance of standard optimal control policies such as the LQG controller.
Paper Structure (8 sections, 3 theorems, 44 equations)

This paper contains 8 sections, 3 theorems, 44 equations.

Key Result

Theorem 1

Fix $\Sigma \in \mathbb{R}^{d \times d}$ and $W \in \mathbb{R}^{k \times d}$. Suppose we are given $z_0, \ldots, z_N \in \mathbb{R}^{d}$ and $z \in \mathbb{R}^{d}$. Define the Nadaraya–Watson estimator The function $F$ can be represented by a softmax self-attention block of size $O(d^2H)$. In particular, there exists a nonlinear embedding map $\phi : \mathbb{R}^d \rightarrow \mathbb{R}^{ \ell}$ a

Theorems & Definitions (6)

  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof