Thermodynamic Isomorphism of Transformers: A Lagrangian Approach to Attention Dynamics
Gunn Kim
TL;DR
This work proposes a first-principles Lagrangian framework for Transformer attention by embedding information states on an information manifold endowed with the Fisher–Rao metric. It demonstrates that the Softmax attention rule emerges as the unique equilibrium that minimizes Helmholtz free energy at an effective temperature $T=\sqrt{d_k}$, while the query–key interaction is interpreted as a field–dipole coupling. The theory unifies inference and learning under a pair of thermodynamic laws for information dynamics and links emergent phenomena such as scaling laws and grokking to phase-transition-like behavior, with RoPE connected to massless Goldstone modes. While offering a coherent physical narrative, the paper emphasizes limitations and outlines concrete avenues for falsification, including cross-architecture tests, continuous-depth formulations, and operational definitions of thermodynamic variables.
Abstract
Although the Transformer architecture has revolutionized artificial intelligence, its underlying mechanisms remain largely heuristic and lack a unified physical theory. In this work, we propose a first-principles framework for information dynamics, treating the attention mechanism as a physical system governed by the principle of least action rather than as an algorithmic optimization. By mapping information states to a Riemannian manifold with the Fisher information metric, we derive the intelligence Lagrangian. We show that the softmax function corresponds to the unique thermodynamic equilibrium state that minimizes the Helmholtz free energy of the information gas. In addition, we identify the query-key interaction as an electrodynamic coupling between an external field and an intrinsic dipole moment. This theory establishes the first law of information thermodynamics, unifying inference (mechanical work) and learning (chemical evolution). It also explains emergent phenomena, such as scaling laws and grokking, as phase transitions characterized by the divergence of specific heat. Finally, we discuss how rotational symmetry breaking in the attention manifold generates massless Goldstone bosons, providing a field-theoretic perspective on rotary positional embeddings (RoPE). Our work connects Statistical Physics and Deep Learning, laying the groundwork for a general theory of physics-based intelligence.
