Can Transformers Learn Optimal Filtering for Unknown Systems?

Haldun Balim; Zhe Du; Samet Oymak; Necmiye Ozay

Can Transformers Learn Optimal Filtering for Unknown Systems?

Haldun Balim, Zhe Du, Samet Oymak, Necmiye Ozay

TL;DR

This work investigates using transformer-based models to predict outputs of unknown dynamical systems by meta-learning over a collection of source systems drawn from a common distribution. The proposed meta-output-predictor (MOP) trains on past outputs to predict the next output, enabling rapid adaptation to unseen dynamics and even matching Kalman-filter optimality for linear systems, while showing promise in non-ideal noise and nonlinear settings such as planar quadrotors. A key theoretical contribution is a generalization bound: the excess risk decays as $\mathcal{O}(1/\sqrt{MT})$ under stability and robustness assumptions, with explicit dependence on the covering number of the transformer class and noise bounds. The work also highlights limitations in slow-mixing systems and under distribution shifts, motivating future research into robustness and safe deployment, including extensions to closed-loop control scenarios.

Abstract

Transformer models have shown great success in natural language processing; however, their potential remains mostly unexplored for dynamical systems. In this work, we investigate the optimal output estimation problem using transformers, which generate output predictions using all the past ones. Particularly, we train the transformer using various distinct systems and then evaluate the performance on unseen systems with unknown dynamics. Empirically, the trained transformer adapts exceedingly well to different unseen systems and even matches the optimal performance given by the Kalman filter for linear systems. In more complex settings with non-i.i.d. noise, time-varying dynamics, and nonlinear dynamics like a quadrotor system with unknown parameters, transformers also demonstrate promising results. To support our experimental findings, we provide statistical guarantees that quantify the amount of training data required for the transformer to achieve a desired excess risk. Finally, we point out some limitations by identifying two classes of problems that lead to degraded performance, highlighting the need for caution when using transformers for control and estimation.

Can Transformers Learn Optimal Filtering for Unknown Systems?

TL;DR

under stability and robustness assumptions, with explicit dependence on the covering number of the transformer class and noise bounds. The work also highlights limitations in slow-mixing systems and under distribution shifts, motivating future research into robustness and safe deployment, including extensions to closed-loop control scenarios.

Abstract

Paper Structure (12 sections, 2 theorems, 23 equations, 5 figures)

This paper contains 12 sections, 2 theorems, 23 equations, 5 figures.

Introduction
Problem Setup
Experiments
Linear Systems
Planar Quadrotor Systems
Theoretical Guarantees
Preliminaries
Performance Guarantees
Proof of the Main Theorem
Systems that are hard to learn in-context
Conclusion
ACKNOWLEDGMENTS

Key Result

Theorem 1

Suppose Assumptions asmp_stability and asmp_tsfmRobustness hold, and the loss function $\ell(\mathbf{y}, \cdot)$ is $L_\ell$-Lipschitz and $\ell(\cdot, \cdot) \leq B$ for some $B \geq 0$. Then, when $MT \geq 3 \max(\sqrt{n}, \sqrt{m})$, for all $\epsilon>0$, with probability at least $1-\delta$, where $\bar{B}{:=} 2B + 7 K L_\ell \bigl( L_g L_\rho \sigma_\mathbf{w} {+} \sigma_\mathbf{v} \bigr) \

Figures (5)

Figure 1: Training a transformer for dynamical system prediction
Figure 2: Output predictions for linear systems: (a) with i.i.d. Gaussian noise; (b) with colored (non-i.i.d.) noise; (c) with dynamics changes at $t = T/2$. MOP performs as well as or better than Kalman filter even though it does not have access to the system dynamics.
Figure 3: Output predictions for planar quadrotor systems.
Figure 4: Comparison between dense and upper-triangular $\mathbf{A}$ matrices: (a) prediction error ratio between MOP and Kalman filter; (b) matrix powers averaged over all source systems.
Figure 5: Performance of MOP compared to Kalman Filter when noise level in test is different than train.

Theorems & Definitions (6)

Definition 1: Covering Number
Definition 2: Distance Metric
Theorem 1
Lemma 1
proof
proof : Proof for Theorem \ref{['thrm_excessRisk']}

Can Transformers Learn Optimal Filtering for Unknown Systems?

TL;DR

Abstract

Can Transformers Learn Optimal Filtering for Unknown Systems?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (6)