An Optimal Control Approach To Transformer Training

Kağan Akman; Naci Saldı; Serdar Yüksel

An Optimal Control Approach To Transformer Training

Kağan Akman, Naci Saldı, Serdar Yüksel

TL;DR

A rigorous optimal control-theoretic approach to Transformer training that respects key structural constraints such as realized-input-independence during execution, the ensemble control nature of the problem, and positional dependence and establishes stability and empirical consistency properties of the lifted model.

Abstract

In this paper, we develop a rigorous optimal control-theoretic approach to Transformer training that respects key structural constraints such as (i) realized-input-independence during execution, (ii) the ensemble control nature of the problem, and (iii) positional dependence. We model the Transformer architecture as a discrete-time controlled particle system with shared actions, exhibiting noise-free McKean-Vlasov dynamics. While the resulting dynamics is not Markovian, we show that lifting it to probability measures produces a fully-observed Markov decision process (MDP). Positional encodings are incorporated into the state space to preserve the sequence order under lifting. Using the dynamic programming principle, we establish the existence of globally optimal policies under mild assumptions of compactness. We further prove that closed-loop policies in the lifted is equivalent to an initial-distribution dependent open-loop policy, which are realized-input-independent and compatible with standard Transformer training. To train a Transformer, we propose a triply quantized training procedure for the lifted MDP by quantizing the state space, the space of probability measures, and the action space, and show that any optimal policy for the triply quantized model is near-optimal for the original training problem. Finally, we establish stability and empirical consistency properties of the lifted model by showing that the value function is continuous with respect to the perturbations of the initial empirical measures and convergence of policies as the data size increases. This approach provides a globally optimal and robust alternative to gradient-based training without requiring smoothness or convexity.

An Optimal Control Approach To Transformer Training

TL;DR

Abstract

Paper Structure (17 sections, 15 theorems, 119 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 17 sections, 15 theorems, 119 equations, 6 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Contributions
The Proposed Optimal Control Formulation of Transformers
Transformer dynamics as a controlled particle system
Data set-induced dynamics and information structure
Cost functional and optimality criterion
McKean---Vlasov Structure and Measure-Valued MDP
Deterministic McKean---Vlasov representation of the particles.
The lifted system and its continuity properties.
Existence of optimal policies
Open-loop policy design via closed-loop policy design for the lifted problem.
Triply Quantized Training Scheme for Transformers
Robustness to Distributional Initialization Errors, Asymptotic Consistency and $\Gamma$-convergence to Optimality for the Generalization Problem
Numerical Experiment
...and 2 more sections

Key Result

Proposition 3

Under Assumption assume:compactness, the function $f$ in eq:McKean---Vlasov is jointly continuous on $\mathbb{X}\times\mathbb{U}\times\mathcal{P}(\mathbb{X})$ where $\mathcal{P}(\mathbb{X})$ is endowed with $\text{weak}^*$-topology.

Figures (6)

Figure 2: Four-level hierarchy of dynamics.
Figure 3: Decompositional illustration of the flow \ref{['eq:dynamics']}.
Figure 4: Flow of the ensemble of empirical measures on particle via the map $\bm{\Phi}$.
Figure 5: Comparison of information structures for open-loop and closed-loop controls. In the open-loop case, $U_t^*$'s are fixed before and do not depend on the current measure unlike the closed-loop case.
Figure 6: Training and test errors as a function of action level.
...and 1 more figures

Theorems & Definitions (18)

Definition 1
Proposition 3
Proposition 4
Corollary 5
Theorem 6
Remark 8
Theorem 9
Proposition 10
Remark 12
Theorem 13
...and 8 more

An Optimal Control Approach To Transformer Training

TL;DR

Abstract

An Optimal Control Approach To Transformer Training

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (18)