SympFormer: Accelerated attention blocks via Inertial Dynamics on Density Manifolds

Viktor Stein; Wuchen Li; Gabriele Steidl

SympFormer: Accelerated attention blocks via Inertial Dynamics on Density Manifolds

Viktor Stein, Wuchen Li, Gabriele Steidl

Abstract

Transformers owe much of their empirical success in natural language processing to the self-attention blocks. Recent perspectives interpret attention blocks as interacting particle systems, whose mean-field limits correspond to gradient flows of interaction energy functionals on probability density spaces equipped with Wasserstein-$2$-type metrics. We extend this viewpoint by introducing accelerated attention blocks derived from inertial Nesterov-type dynamics on density spaces. In our proposed architecture, tokens carry both spatial (feature) and velocity variables. The time discretization and the approximation of accelerated density dynamics yield Hamiltonian momentum attention blocks, which constitute the proposed accelerated attention architectures. In particular, for linear self-attention, we show that the attention blocks approximate a Stein variational gradient flow, using a bilinear kernel, of a potential energy. In this setting, we prove that elliptically contoured probability distributions are preserved by the accelerated attention blocks. We present implementable particle-based algorithms and demonstrate that the proposed accelerated attention blocks converge faster than the classical attention blocks while preserving the number of oracle calls.

SympFormer: Accelerated attention blocks via Inertial Dynamics on Density Manifolds

Abstract

-type metrics. We extend this viewpoint by introducing accelerated attention blocks derived from inertial Nesterov-type dynamics on density spaces. In our proposed architecture, tokens carry both spatial (feature) and velocity variables. The time discretization and the approximation of accelerated density dynamics yield Hamiltonian momentum attention blocks, which constitute the proposed accelerated attention architectures. In particular, for linear self-attention, we show that the attention blocks approximate a Stein variational gradient flow, using a bilinear kernel, of a potential energy. In this setting, we prove that elliptically contoured probability distributions are preserved by the accelerated attention blocks. We present implementable particle-based algorithms and demonstrate that the proposed accelerated attention blocks converge faster than the classical attention blocks while preserving the number of oracle calls.

Paper Structure (23 sections, 6 theorems, 81 equations, 2 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 6 theorems, 81 equations, 2 figures, 4 tables, 1 algorithm.

Introduction
Nesterov's Acceleration Method
Time Discretizations of Linearly Damped Hamiltonian Systems
Explicit Euler method
Conformally symplectic Euler method
Exponential Euler method
Adams-Bashforth (AB-2) method
Transformers
Accelerated Transformers with Linear Attention
Accelerated Flow
Preservation of Elliptically Contoured Distributions
Accelerated Linear Attention Dynamics
Accelerated Transformers with Softmax attention
Accelerated Flow
Accelerated Softmax Attention Dynamics
...and 8 more sections

Key Result

Proposition 4.1

The accelerated linear self-attention transformer flow in sys_1 satisfies Here, $(\alpha_t)_{t > 0}$ are non-negative damping parameters.

Figures (2)

Figure 1: The accelerated attention block followed by an accelerated MLP block as described in \ref{['algo:SympFormer']}.
Figure 2: Validation loss (circles) and training loss (lines) on the tinystories data set after 10000 optimization steps, illustrating the results from \ref{['tab:tinystories_softmax_1']}.

Theorems & Definitions (10)

Proposition 4.1: Accelerated linear attention PDE
Example 4.2: Elliptically contoured distributions
Proposition 4.3: Preservation of elliptically contoured distributions
Corollary 4.4: Preservation of centered elliptically contoured distributions
Proposition 4.5
Proposition 5.1: Accelerated softmax attention PDE
Proposition 5.2
Remark 5.3
Example 5.4
Example 5.5

SympFormer: Accelerated attention blocks via Inertial Dynamics on Density Manifolds

Abstract

SympFormer: Accelerated attention blocks via Inertial Dynamics on Density Manifolds

Authors

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (10)