Provable optimal transport with transformers: The essence of depth and prompt engineering
Hadi Daneshmand
TL;DR
The paper links token alignment in transformers to discrete optimal transport (OT), providing a mechanistic interpretation in which softmax self-attention effectively performs gradient-descent steps on the OT dual objective. It proves depth controls OT approximation accuracy with a constructive bound, and shows that deep, pre-trained transformers can solve OT and even sorting tasks without retraining, aided by engineered prompts that extend memory. Empirical results in English–French translation and embedding-based OT corroborate that attention progressively aligns semantically related word pairs and that prompt design dramatically enhances in-context computation. Together, these results offer both theoretical insight into transformer dynamics and practical guidance for prompting strategies in alignment and order-related tasks.
Abstract
Despite their empirical success, the internal mechanism by which transformer models align tokens during language processing remains poorly understood. This paper provides a mechanistic and theoretical explanation of token alignment in LLMs. We first present empirical evidences showing that, in machine translation, attention weights progressively align translated word pairs across layers, closely approximating Optimal Transport (OT) between word embeddings. Building on this observation, we prove that softmax self-attention layers can simulate gradient descent on the dual of the entropy-regularized OT problem, providing a theoretical foundation for the alignment. Our analysis yields a constructive convergence bound showing that transformer depth controls OT approximation accuracy. A direct implication is that standard transformers can sort lists of varying lengths without any parameter adjustment, up to an error term vanishing with transformers depth.
