Table of Contents
Fetching ...

The Curved Spacetime of Transformer Architectures

Riccardo Di Sipio, Jairo Diaz-Rodriguez, Luis Serrano

TL;DR

This work introduces a differential-geometry inspired framework for Transformer-based language models, positing that queries/keys induce an effective metric $g_{ij}$ and attention implements a discrete connection that transports value vectors across tokens. By treating stacked Transformer layers as discrete time steps, the embeddings trace geodesic-like, curvature-bearing trajectories, and backpropagation is interpreted as a least-action optimization over a semantic manifold. The authors propose concrete curvature proxies—the local turning angle and the global length-to-chord ratio—and validate them with three experiments, including a paragraph-scale curvature landscape, cross-layer curvature analysis, and a context-driven gravitational-lensing test that reveals context-dependent deflections in embedding trajectories. Their results show nontrivial curvature that cannot be explained by dimensionality alone and demonstrate that contextual edits can measurably bend embedding paths in a meaning-consistent manner, offering a new geometric lens for understanding and potentially guiding Transformer representations and interpretability.

Abstract

We present a geometric framework for understanding Transformer-based language models, drawing an explicit analogy to General Relativity. Queries and keys induce an effective metric on representation space, and attention acts as a discrete connection that implements parallel transport of value vectors across tokens. Stacked layers provide discrete time-slices through which token representations evolve on this curved manifold, while backpropagation plays the role of a least-action principle that shapes loss-minimizing trajectories in parameter space. If this analogy is correct, token embeddings should not traverse straight paths in feature space; instead, their layer-wise steps should bend and reorient as interactions mediated by embedding space curvature. To test this prediction, we design experiments that expose both the presence and the consequences of curvature: (i) we visualize a curvature landscape for a full paragraph, revealing how local turning angles vary across tokens and layers; (ii) we show through simulations that excess counts of sharp/flat angles and longer length-to-chord ratios are not explainable by dimensionality or chance; and (iii) inspired by Einstein's eclipse experiment, we probe deflection under controlled context edits, demonstrating measurable, meaning-consistent bends in embedding trajectories that confirm attention-induced curvature.

The Curved Spacetime of Transformer Architectures

TL;DR

This work introduces a differential-geometry inspired framework for Transformer-based language models, positing that queries/keys induce an effective metric and attention implements a discrete connection that transports value vectors across tokens. By treating stacked Transformer layers as discrete time steps, the embeddings trace geodesic-like, curvature-bearing trajectories, and backpropagation is interpreted as a least-action optimization over a semantic manifold. The authors propose concrete curvature proxies—the local turning angle and the global length-to-chord ratio—and validate them with three experiments, including a paragraph-scale curvature landscape, cross-layer curvature analysis, and a context-driven gravitational-lensing test that reveals context-dependent deflections in embedding trajectories. Their results show nontrivial curvature that cannot be explained by dimensionality alone and demonstrate that contextual edits can measurably bend embedding paths in a meaning-consistent manner, offering a new geometric lens for understanding and potentially guiding Transformer representations and interpretability.

Abstract

We present a geometric framework for understanding Transformer-based language models, drawing an explicit analogy to General Relativity. Queries and keys induce an effective metric on representation space, and attention acts as a discrete connection that implements parallel transport of value vectors across tokens. Stacked layers provide discrete time-slices through which token representations evolve on this curved manifold, while backpropagation plays the role of a least-action principle that shapes loss-minimizing trajectories in parameter space. If this analogy is correct, token embeddings should not traverse straight paths in feature space; instead, their layer-wise steps should bend and reorient as interactions mediated by embedding space curvature. To test this prediction, we design experiments that expose both the presence and the consequences of curvature: (i) we visualize a curvature landscape for a full paragraph, revealing how local turning angles vary across tokens and layers; (ii) we show through simulations that excess counts of sharp/flat angles and longer length-to-chord ratios are not explainable by dimensionality or chance; and (iii) inspired by Einstein's eclipse experiment, we probe deflection under controlled context edits, demonstrating measurable, meaning-consistent bends in embedding trajectories that confirm attention-induced curvature.

Paper Structure

This paper contains 48 sections, 53 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Graphical analogy between spacetime curvature and embedding curvature. On the right, spacetime curvature evolves over time according to the distribution of mass and energy. On the left, embedding curvature evolves across layers according to the learned weights.
  • Figure 2: 2D curvature landscape for the same paragraph at six consecutive layers. Token positions are projected onto a PCA plane, with color encoding turning angle: blue areas indicate straighter motion ($<90^\circ$), and red areas indicate sharper bending ($>90^\circ$). This heatmap highlights regions of high contextual curvature where embeddings are more actively reshaped by attention.
  • Figure 3: Foliated heatmap visualization of token embeddings across Transformer layers. Each wavy sheet corresponds to one layer, colored by the local turning angle (blue: $<90^\circ$, red: $>90^\circ$). The vertical axis represents discrete "semantic time" as the paragraph is processed layer by layer. The trajectory of tokens can be traced across sheets, illustrating how local curvature evolves throughout the model.
  • Figure 4: Analogy between relativistic curvature and embedding curvature. (a) Without curvature, light travels in a straight line. (b) Near the Sun, spacetime curvature bends light, producing an apparent shift. (c) In language models, the token “BANK” follows a curved trajectory in representation space depending on its semantic context.
  • Figure 5: Trajectory divergence results for bert-base-uncased. Each subplot reports one of the four divergence metrics computed across 50 sentence triples. Boxplots show pairwise comparisons between sentence variants (with vs. without, without vs. base, with vs. base); higher values denote stronger divergence between embedding trajectories.