The Curved Spacetime of Transformer Architectures
Riccardo Di Sipio, Jairo Diaz-Rodriguez, Luis Serrano
TL;DR
This work introduces a differential-geometry inspired framework for Transformer-based language models, positing that queries/keys induce an effective metric $g_{ij}$ and attention implements a discrete connection that transports value vectors across tokens. By treating stacked Transformer layers as discrete time steps, the embeddings trace geodesic-like, curvature-bearing trajectories, and backpropagation is interpreted as a least-action optimization over a semantic manifold. The authors propose concrete curvature proxies—the local turning angle and the global length-to-chord ratio—and validate them with three experiments, including a paragraph-scale curvature landscape, cross-layer curvature analysis, and a context-driven gravitational-lensing test that reveals context-dependent deflections in embedding trajectories. Their results show nontrivial curvature that cannot be explained by dimensionality alone and demonstrate that contextual edits can measurably bend embedding paths in a meaning-consistent manner, offering a new geometric lens for understanding and potentially guiding Transformer representations and interpretability.
Abstract
We present a geometric framework for understanding Transformer-based language models, drawing an explicit analogy to General Relativity. Queries and keys induce an effective metric on representation space, and attention acts as a discrete connection that implements parallel transport of value vectors across tokens. Stacked layers provide discrete time-slices through which token representations evolve on this curved manifold, while backpropagation plays the role of a least-action principle that shapes loss-minimizing trajectories in parameter space. If this analogy is correct, token embeddings should not traverse straight paths in feature space; instead, their layer-wise steps should bend and reorient as interactions mediated by embedding space curvature. To test this prediction, we design experiments that expose both the presence and the consequences of curvature: (i) we visualize a curvature landscape for a full paragraph, revealing how local turning angles vary across tokens and layers; (ii) we show through simulations that excess counts of sharp/flat angles and longer length-to-chord ratios are not explainable by dimensionality or chance; and (iii) inspired by Einstein's eclipse experiment, we probe deflection under controlled context edits, demonstrating measurable, meaning-consistent bends in embedding trajectories that confirm attention-induced curvature.
