Table of Contents
Fetching ...

GeoPE:A Unified Geometric Positional Embedding for Structured Tensors

Yupu Yao, Bowen Yang

TL;DR

GeoPE introduces a unified 3D geometric positional embedding for structured tensors by lifting 2D spatial coordinates into quaternion rotations and combining them with a symmetric log-exp average in the Lie algebra to overcome non-commutativity. This coupling restores true 2D spatial topology, enabling more global and geometrically meaningful attention patterns. The method extends to 3D inputs and includes a linear variant for relative encoding, achieving gains across image classification, object detection, and 3D segmentation, while also enhancing shape bias. The work demonstrates that explicit geometric priors can improve spatial reasoning and extrapolation without increasing asymptotic complexity significantly.

Abstract

Standard Vision Transformers flatten 2D images into 1D sequences, disrupting the natural spatial topology. While Rotary Positional Embedding (RoPE) excels in 1D, it inherits this limitation, often treating spatially distant patches (e.g., at row edges) as sequence neighbors. Existing 2D approaches typically treat spatial axes independently, failing to decouple this false sequential proximity from true spatial distance. To restore the 2D spatial manifold, we introduce Geometric Positional Embedding (GeoPE), a framework that extends rotations to 3D Euclidean space using quaternions. To overcome non-commutativity and ensure symmetry, GeoPE constructs a unified rotational operator by computing the geometric mean in the Lie algebra. This creates a geometrically coupled encoding that effectively separates spatial dimensions. Extensive experiments on image classification, object detection, and 3D semantic segmentation demonstrate that GeoPE consistently outperforms existing 2D RoPE variants and significantly enhances shape bias, confirming its ability to capture true geometric structure.

GeoPE:A Unified Geometric Positional Embedding for Structured Tensors

TL;DR

GeoPE introduces a unified 3D geometric positional embedding for structured tensors by lifting 2D spatial coordinates into quaternion rotations and combining them with a symmetric log-exp average in the Lie algebra to overcome non-commutativity. This coupling restores true 2D spatial topology, enabling more global and geometrically meaningful attention patterns. The method extends to 3D inputs and includes a linear variant for relative encoding, achieving gains across image classification, object detection, and 3D segmentation, while also enhancing shape bias. The work demonstrates that explicit geometric priors can improve spatial reasoning and extrapolation without increasing asymptotic complexity significantly.

Abstract

Standard Vision Transformers flatten 2D images into 1D sequences, disrupting the natural spatial topology. While Rotary Positional Embedding (RoPE) excels in 1D, it inherits this limitation, often treating spatially distant patches (e.g., at row edges) as sequence neighbors. Existing 2D approaches typically treat spatial axes independently, failing to decouple this false sequential proximity from true spatial distance. To restore the 2D spatial manifold, we introduce Geometric Positional Embedding (GeoPE), a framework that extends rotations to 3D Euclidean space using quaternions. To overcome non-commutativity and ensure symmetry, GeoPE constructs a unified rotational operator by computing the geometric mean in the Lie algebra. This creates a geometrically coupled encoding that effectively separates spatial dimensions. Extensive experiments on image classification, object detection, and 3D semantic segmentation demonstrate that GeoPE consistently outperforms existing 2D RoPE variants and significantly enhances shape bias, confirming its ability to capture true geometric structure.

Paper Structure

This paper contains 27 sections, 63 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Geometric Transform of Geometric Positional Embedding (GeoPE). This figure illustrates how GeoPE encodes $2D$ positions (e.g., $(m, n)$) by extending Rotary Positional Embedding (RoPE) to $3D$ space using quaternions. For each feature sub-vector $(x_1, x_2, x_3)$, GeoPE calculates the geometric mean of the height and width rotations in the Lie algebra to create a unified, symmetric rotation operator. This operator then applies a geometrically coupled $3D$ rotation to the query/key sub-vector via a sandwich product ($\mathbf{p}' = \mathbf{r} \mathbf{p} \mathbf{r}^*$) to inject the positional bias.
  • Figure 2: Illustration of mathematical structure and coordinate transform.
  • Figure 2:
  • Figure 3: Mean attention distance as a function of layer depth across different input resolutions. The distance is computed as the average over attention scores, where query–key spatial distances are weighted by their corresponding attention weights and then normalized. While all methods exhibit an expanding receptive field in deeper layers, APE's consistently higher distance suggests an inefficient and unfocused global search. In contrast, GeoPE maintains a more moderate distance, indicating a more structured and efficient strategy for balancing local and global information gathering. These relative trends remain consistent across all tested resolutions.
  • Figure 4: Attention Map Visualization.This figure compares the self-attention patterns from the final layer of ViT-Base models, evaluated after pre-training from scratch on ImageNet-1K. The heatmaps visualize the cosine similarity between patch representations, averaged across all attention heads, where the fine-grained patterns within the large squares reflect the feature correlation and similarity among the pixels inside each input patch. APE results in highly localized attention focused on the diagonal. RoPE-mixed shows a more distributed local pattern. In contrast, GeoPE facilitates complex, long-range attention, indicating a significantly more global receptive field. GeoPE's global attention pattern demonstrates its improved ability to integrate features across the entire image based on geometric structure.
  • ...and 1 more figures