Table of Contents
Fetching ...

An Optimal Control Approach To Transformer Training

Kağan Akman, Naci Saldı, Serdar Yüksel

TL;DR

A rigorous optimal control-theoretic approach to Transformer training that respects key structural constraints such as realized-input-independence during execution, the ensemble control nature of the problem, and positional dependence and establishes stability and empirical consistency properties of the lifted model.

Abstract

In this paper, we develop a rigorous optimal control-theoretic approach to Transformer training that respects key structural constraints such as (i) realized-input-independence during execution, (ii) the ensemble control nature of the problem, and (iii) positional dependence. We model the Transformer architecture as a discrete-time controlled particle system with shared actions, exhibiting noise-free McKean-Vlasov dynamics. While the resulting dynamics is not Markovian, we show that lifting it to probability measures produces a fully-observed Markov decision process (MDP). Positional encodings are incorporated into the state space to preserve the sequence order under lifting. Using the dynamic programming principle, we establish the existence of globally optimal policies under mild assumptions of compactness. We further prove that closed-loop policies in the lifted is equivalent to an initial-distribution dependent open-loop policy, which are realized-input-independent and compatible with standard Transformer training. To train a Transformer, we propose a triply quantized training procedure for the lifted MDP by quantizing the state space, the space of probability measures, and the action space, and show that any optimal policy for the triply quantized model is near-optimal for the original training problem. Finally, we establish stability and empirical consistency properties of the lifted model by showing that the value function is continuous with respect to the perturbations of the initial empirical measures and convergence of policies as the data size increases. This approach provides a globally optimal and robust alternative to gradient-based training without requiring smoothness or convexity.

An Optimal Control Approach To Transformer Training

TL;DR

A rigorous optimal control-theoretic approach to Transformer training that respects key structural constraints such as realized-input-independence during execution, the ensemble control nature of the problem, and positional dependence and establishes stability and empirical consistency properties of the lifted model.

Abstract

In this paper, we develop a rigorous optimal control-theoretic approach to Transformer training that respects key structural constraints such as (i) realized-input-independence during execution, (ii) the ensemble control nature of the problem, and (iii) positional dependence. We model the Transformer architecture as a discrete-time controlled particle system with shared actions, exhibiting noise-free McKean-Vlasov dynamics. While the resulting dynamics is not Markovian, we show that lifting it to probability measures produces a fully-observed Markov decision process (MDP). Positional encodings are incorporated into the state space to preserve the sequence order under lifting. Using the dynamic programming principle, we establish the existence of globally optimal policies under mild assumptions of compactness. We further prove that closed-loop policies in the lifted is equivalent to an initial-distribution dependent open-loop policy, which are realized-input-independent and compatible with standard Transformer training. To train a Transformer, we propose a triply quantized training procedure for the lifted MDP by quantizing the state space, the space of probability measures, and the action space, and show that any optimal policy for the triply quantized model is near-optimal for the original training problem. Finally, we establish stability and empirical consistency properties of the lifted model by showing that the value function is continuous with respect to the perturbations of the initial empirical measures and convergence of policies as the data size increases. This approach provides a globally optimal and robust alternative to gradient-based training without requiring smoothness or convexity.
Paper Structure (17 sections, 15 theorems, 119 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 17 sections, 15 theorems, 119 equations, 6 figures, 2 tables, 1 algorithm.

Key Result

Proposition 3

Under Assumption assume:compactness, the function $f$ in eq:McKean---Vlasov is jointly continuous on $\mathbb{X}\times\mathbb{U}\times\mathcal{P}(\mathbb{X})$ where $\mathcal{P}(\mathbb{X})$ is endowed with $\text{weak}^*$-topology.

Figures (6)

  • Figure 2: Four-level hierarchy of dynamics.
  • Figure 3: Decompositional illustration of the flow \ref{['eq:dynamics']}.
  • Figure 4: Flow of the ensemble of empirical measures on particle via the map $\bm{\Phi}$.
  • Figure 5: Comparison of information structures for open-loop and closed-loop controls. In the open-loop case, $U_t^*$'s are fixed before and do not depend on the current measure unlike the closed-loop case.
  • Figure 6: Training and test errors as a function of action level.
  • ...and 1 more figures

Theorems & Definitions (18)

  • Definition 1
  • Proposition 3
  • Proposition 4
  • Corollary 5
  • Theorem 6
  • Remark 8
  • Theorem 9
  • Proposition 10
  • Remark 12
  • Theorem 13
  • ...and 8 more