Table of Contents
Fetching ...

Approximation Theory for Lipschitz Continuous Transformers

Takashi Furuya, Davide Murari, Carola-Bibiane Schönlieb

TL;DR

This work introduces a class of gradient-descent-type in-context Transformers that are Lipschitz-continuous by construction, and proves a universal approximation theorem for this class within a Lipschitz-constrained function space.

Abstract

Stability and robustness are critical for deploying Transformers in safety-sensitive settings. A principled way to enforce such behavior is to constrain the model's Lipschitz constant. However, approximation-theoretic guarantees for architectures that explicitly preserve Lipschitz continuity have yet to be established. In this work, we bridge this gap by introducing a class of gradient-descent-type in-context Transformers that are Lipschitz-continuous by construction. We realize both MLP and attention blocks as explicit Euler steps of negative gradient flows, ensuring inherent stability without sacrificing expressivity. We prove a universal approximation theorem for this class within a Lipschitz-constrained function space. Crucially, our analysis adopts a measure-theoretic formalism, interpreting Transformers as operators on probability measures, to yield approximation guarantees independent of token count. These results provide a rigorous theoretical foundation for the design of robust, Lipschitz continuous Transformer architectures.

Approximation Theory for Lipschitz Continuous Transformers

TL;DR

This work introduces a class of gradient-descent-type in-context Transformers that are Lipschitz-continuous by construction, and proves a universal approximation theorem for this class within a Lipschitz-constrained function space.

Abstract

Stability and robustness are critical for deploying Transformers in safety-sensitive settings. A principled way to enforce such behavior is to constrain the model's Lipschitz constant. However, approximation-theoretic guarantees for architectures that explicitly preserve Lipschitz continuity have yet to be established. In this work, we bridge this gap by introducing a class of gradient-descent-type in-context Transformers that are Lipschitz-continuous by construction. We realize both MLP and attention blocks as explicit Euler steps of negative gradient flows, ensuring inherent stability without sacrificing expressivity. We prove a universal approximation theorem for this class within a Lipschitz-constrained function space. Crucially, our analysis adopts a measure-theoretic formalism, interpreting Transformers as operators on probability measures, to yield approximation guarantees independent of token count. These results provide a rigorous theoretical foundation for the design of robust, Lipschitz continuous Transformer architectures.
Paper Structure (24 sections, 9 theorems, 123 equations)

This paper contains 24 sections, 9 theorems, 123 equations.

Key Result

Lemma 1

sherry2024designing Assuming that $\tau \in [0, 2/\|W\|_2^2]$, then the map $F_{\xi} : (\mathbb{R}^d, \|\cdot\|_2) \to (\mathbb{R}^d, \|\cdot\|_2)$ defined in eq:1lipMLP with $\sigma=\mathrm{ReLU}$ is 1-Lipschitz continuous.

Theorems & Definitions (13)

  • Lemma 1
  • Lemma 2
  • Remark 3
  • Definition 4: Propagation of the measure of tokens
  • Definition 5: Lipschitz deep Transformer
  • Lemma 6
  • Remark 7
  • Theorem 8
  • Lemma 9: Variant of the Restricted Stone--Weierstrass theorem
  • Lemma 10
  • ...and 3 more