Table of Contents
Fetching ...

From Embeddings to Dyson Series: Transformer Mechanics as Non-Hermitian Operator Theory

Po-Hao Chang

Abstract

Transformer architectures are typically described in algorithmic and statistical terms, leaving their internal mechanics without a familiar structural language for researchers trained in physical theories. To bridge this gap, we develop a complementary operator-theoretic framework that recasts their mechanics in a language familiar to many-body physics. Beginning from the token as a discrete index without intrinsic geometry, we show that embedding corresponds to a basis transformation into a continuous representation space. Once such a reference basis is established, self-attention naturally assumes the role of a non-Hermitian interaction operator, and network depth implements an ordered composition of these interactions. Within this formulation, several empirical properties of deep Transformers -- including stability at large depth, representational saturation, and the effectiveness of multi-head decomposition -- find natural structural interpretations as consequences of regulated operator composition. Together, spectral geometry, channel factorization, and normalization emerge as organizing structural logic rather than isolated architectural choices. This perspective does not rely on post-hoc analogy, but follows a constructive path in which each parallel arises from the preceding structural step. By recasting Transformer mechanics in operator language, the framework lowers the conceptual barrier between deep learning and many-body physics through shared mathematical structure, making tools and intuitions from each domain more readily legible to the other.

From Embeddings to Dyson Series: Transformer Mechanics as Non-Hermitian Operator Theory

Abstract

Transformer architectures are typically described in algorithmic and statistical terms, leaving their internal mechanics without a familiar structural language for researchers trained in physical theories. To bridge this gap, we develop a complementary operator-theoretic framework that recasts their mechanics in a language familiar to many-body physics. Beginning from the token as a discrete index without intrinsic geometry, we show that embedding corresponds to a basis transformation into a continuous representation space. Once such a reference basis is established, self-attention naturally assumes the role of a non-Hermitian interaction operator, and network depth implements an ordered composition of these interactions. Within this formulation, several empirical properties of deep Transformers -- including stability at large depth, representational saturation, and the effectiveness of multi-head decomposition -- find natural structural interpretations as consequences of regulated operator composition. Together, spectral geometry, channel factorization, and normalization emerge as organizing structural logic rather than isolated architectural choices. This perspective does not rely on post-hoc analogy, but follows a constructive path in which each parallel arises from the preceding structural step. By recasting Transformer mechanics in operator language, the framework lowers the conceptual barrier between deep learning and many-body physics through shared mathematical structure, making tools and intuitions from each domain more readily legible to the other.
Paper Structure (13 sections, 17 equations, 3 figures)

This paper contains 13 sections, 17 equations, 3 figures.

Figures (3)

  • Figure 1: Schematic representation of the Transformer architecture. (a) Transformer layers are depicted as discrete evolution steps along the vertical axis (red), representing ordered layer depth. (b) Detailed view of an individual $l$-th layer: the Self-attention block introduces non-local, off-diagonal coupling across the lattice sites, while the Feed-Forward Network (FFN) acts as a local, on-site operator. The forward pass through the network corresponds to the successive application of these operators, analogous to an evolution process.
  • Figure 2: Schematic view of multi-head attention as operator channel factorization. (a) A dense effective interaction $V_{\text{eff}}$ operating on the full representation space $d_{model}$ to map input states $x_j$ to updated states $x_i$. (b) In multi-head attention, the interaction is block-diagonalized into $h$ independent channels ($h=4$ in this example). The state vectors themselves are partitioned into corresponding reduced subspaces ($x_j^{(h)}$ and $x_i^{(h)}$). Each block $V_{\text{eff}}^{(h)}$ operates exclusively within its own sub-vector.
  • Figure 3: Geometric mapping between first-order perturbation theory and contextual token mixing. (Left) An unperturbed quantum reference state $|\psi_{i}^{(0)}\rangle$ is dressed by surrounding basis states $|\psi_{j}^{(0)}\rangle$ to form the perturbed state $|\psi_{i}^{(1)}\rangle$. (Right) Analogously, an ambiguous static token embedding $x_{i}$ (e.g., "bank" ) is contextually resolved by mixing in representations from preceding tokens (e.g., "account", "money" ) to produce the updated state $x_{i}^{\mathrm{new}}$