Table of Contents
Fetching ...

Thermodynamic Isomorphism of Transformers: A Lagrangian Approach to Attention Dynamics

Gunn Kim

TL;DR

This work proposes a first-principles Lagrangian framework for Transformer attention by embedding information states on an information manifold endowed with the Fisher–Rao metric. It demonstrates that the Softmax attention rule emerges as the unique equilibrium that minimizes Helmholtz free energy at an effective temperature $T=\sqrt{d_k}$, while the query–key interaction is interpreted as a field–dipole coupling. The theory unifies inference and learning under a pair of thermodynamic laws for information dynamics and links emergent phenomena such as scaling laws and grokking to phase-transition-like behavior, with RoPE connected to massless Goldstone modes. While offering a coherent physical narrative, the paper emphasizes limitations and outlines concrete avenues for falsification, including cross-architecture tests, continuous-depth formulations, and operational definitions of thermodynamic variables.

Abstract

Although the Transformer architecture has revolutionized artificial intelligence, its underlying mechanisms remain largely heuristic and lack a unified physical theory. In this work, we propose a first-principles framework for information dynamics, treating the attention mechanism as a physical system governed by the principle of least action rather than as an algorithmic optimization. By mapping information states to a Riemannian manifold with the Fisher information metric, we derive the intelligence Lagrangian. We show that the softmax function corresponds to the unique thermodynamic equilibrium state that minimizes the Helmholtz free energy of the information gas. In addition, we identify the query-key interaction as an electrodynamic coupling between an external field and an intrinsic dipole moment. This theory establishes the first law of information thermodynamics, unifying inference (mechanical work) and learning (chemical evolution). It also explains emergent phenomena, such as scaling laws and grokking, as phase transitions characterized by the divergence of specific heat. Finally, we discuss how rotational symmetry breaking in the attention manifold generates massless Goldstone bosons, providing a field-theoretic perspective on rotary positional embeddings (RoPE). Our work connects Statistical Physics and Deep Learning, laying the groundwork for a general theory of physics-based intelligence.

Thermodynamic Isomorphism of Transformers: A Lagrangian Approach to Attention Dynamics

TL;DR

This work proposes a first-principles Lagrangian framework for Transformer attention by embedding information states on an information manifold endowed with the Fisher–Rao metric. It demonstrates that the Softmax attention rule emerges as the unique equilibrium that minimizes Helmholtz free energy at an effective temperature , while the query–key interaction is interpreted as a field–dipole coupling. The theory unifies inference and learning under a pair of thermodynamic laws for information dynamics and links emergent phenomena such as scaling laws and grokking to phase-transition-like behavior, with RoPE connected to massless Goldstone modes. While offering a coherent physical narrative, the paper emphasizes limitations and outlines concrete avenues for falsification, including cross-architecture tests, continuous-depth formulations, and operational definitions of thermodynamic variables.

Abstract

Although the Transformer architecture has revolutionized artificial intelligence, its underlying mechanisms remain largely heuristic and lack a unified physical theory. In this work, we propose a first-principles framework for information dynamics, treating the attention mechanism as a physical system governed by the principle of least action rather than as an algorithmic optimization. By mapping information states to a Riemannian manifold with the Fisher information metric, we derive the intelligence Lagrangian. We show that the softmax function corresponds to the unique thermodynamic equilibrium state that minimizes the Helmholtz free energy of the information gas. In addition, we identify the query-key interaction as an electrodynamic coupling between an external field and an intrinsic dipole moment. This theory establishes the first law of information thermodynamics, unifying inference (mechanical work) and learning (chemical evolution). It also explains emergent phenomena, such as scaling laws and grokking, as phase transitions characterized by the divergence of specific heat. Finally, we discuss how rotational symmetry breaking in the attention manifold generates massless Goldstone bosons, providing a field-theoretic perspective on rotary positional embeddings (RoPE). Our work connects Statistical Physics and Deep Learning, laying the groundwork for a general theory of physics-based intelligence.
Paper Structure (42 sections, 30 equations, 1 figure)

This paper contains 42 sections, 30 equations, 1 figure.

Figures (1)

  • Figure 1: Theoretical estimation of the value of $\kappa$. The plot shows the convergence of $\kappa$ for two different model architectures as a function of context window volume ($V_{ctx}$). The system exhibits a phase transition from an initial transient state (high variance) to a stable thermodynamic equilibrium, suggesting that $\kappa$ may be an intensive property of the intelligence system.