Table of Contents
Fetching ...

Group Representational Position Encoding

Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan, Kangping Xu, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao

TL;DR

GRAPE introduces a unifying group-theoretic framework for positional encoding in transformers, bridging multiplicative rotations and additive biases. It decomposes the approach into Multiplicative GRAPE (norm-preserving SO(d) rotations via rank-2 generators) and Additive GRAPE (unipotent actions in GL with homogeneous lifts), showing RoPE and ALiBi (and FoX) as exact instances. The framework enables multi-subspace extensions (GRAPE-M) and path-integral additive biases (GRAPE-AP), all with exact relative laws and efficient streaming. Empirical results on large language modeling demonstrate GRAPE variants outperforming RoPE and FoX while offering improved stability and scalability.

Abstract

We present GRAPE (Group RepresentAtional Position Encoding), a unified framework for positional encoding based on group actions. GRAPE brings together two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in $\mathrm{SO}(d)$ and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group $\mathrm{GL}$. In Multiplicative GRAPE, a position $n \in \mathbb{Z}$ (or $t \in \mathbb{R}$) acts as $\mathbf{G}(n)=\exp(n\,ω\,\mathbf{L})$ with a rank-2 skew generator $\mathbf{L} \in \mathbb{R}^{d \times d}$, yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the $d/2$ planes are the canonical coordinate pairs with log-uniform spectrum. Learned commuting subspaces and compact non-commuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at $O(d)$ and $O(r d)$ cost per head, respectively. In Additive GRAPE, additive logits arise as rank-1 (or low-rank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Altogether, GRAPE supplies a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Project Page: https://github.com/model-architectures/GRAPE.

Group Representational Position Encoding

TL;DR

GRAPE introduces a unifying group-theoretic framework for positional encoding in transformers, bridging multiplicative rotations and additive biases. It decomposes the approach into Multiplicative GRAPE (norm-preserving SO(d) rotations via rank-2 generators) and Additive GRAPE (unipotent actions in GL with homogeneous lifts), showing RoPE and ALiBi (and FoX) as exact instances. The framework enables multi-subspace extensions (GRAPE-M) and path-integral additive biases (GRAPE-AP), all with exact relative laws and efficient streaming. Empirical results on large language modeling demonstrate GRAPE variants outperforming RoPE and FoX while offering improved stability and scalability.

Abstract

We present GRAPE (Group RepresentAtional Position Encoding), a unified framework for positional encoding based on group actions. GRAPE brings together two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group . In Multiplicative GRAPE, a position (or ) acts as with a rank-2 skew generator , yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the planes are the canonical coordinate pairs with log-uniform spectrum. Learned commuting subspaces and compact non-commuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at and cost per head, respectively. In Additive GRAPE, additive logits arise as rank-1 (or low-rank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Altogether, GRAPE supplies a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Project Page: https://github.com/model-architectures/GRAPE.

Paper Structure

This paper contains 31 sections, 9 theorems, 68 equations, 3 figures, 3 tables, 3 algorithms.

Key Result

Corollary 2.1

If $\|\mathbf{a}\|=1$, the rotation angle reduces to $n\omega$. Without normalization, the effective frequency is $\omega_{\mathrm{eff}}=\omega\|\mathbf{a}\|^2$, so the scale of $a$ can be absorbed into $\omega$.

Figures (3)

  • Figure 1: Overview of the GRAPE Framework. We unify positional encodings via group actions $\mathbf{G}(n)=\exp(n\omega\mathbf{L})$. Left: Multiplicative GRAPE recovers RoPE via rank-2 skew generators in $\mathrm{SO}(d)$. Right: Additive GRAPE recovers ALiBi and FoX via low-rank nilpotent generators in the unipotent subgroup of $\mathrm{GL}(d+k)$ ($k = 1$ or $2$).
  • Figure 2: The training and validation loss of medium-size models (355M), with different positional encoding mechanisms on the FineWeb-Edu 100B dataset.
  • Figure 3: The training and validation loss of large-size models (770M), with different positional encoding mechanisms on the FineWeb-Edu 100B dataset.

Theorems & Definitions (9)

  • Corollary 2.1: Frequency–norm coupling
  • Proposition 3.1: RoPE is a multiplicative GRAPE
  • Lemma I.1: Rank-2 spectrum
  • Corollary I.2: Phase bounds and orthogonality
  • Proposition I.3: Eigenvalues and Jordan structure of additive lifts
  • Lemma I.4: Exact singular-value pair for a canonical rank-1 unipotent
  • Corollary I.5: ALiBi and Additive GRAPE (GRAPE-A) conditioning numbers
  • Lemma I.6: General operator-norm bounds for index-$2$ unipotents
  • Lemma I.7: Orthogonality condition for PaTH factors