Group Representational Position Encoding

Yifan Zhang; Zixiang Chen; Yifeng Liu; Zhen Qin; Huizhuo Yuan; Kangping Xu; Yang Yuan; Quanquan Gu; Andrew Chi-Chih Yao

Group Representational Position Encoding

Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan, Kangping Xu, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao

TL;DR

GRAPE introduces a unifying group-theoretic framework for positional encoding in transformers, bridging multiplicative rotations and additive biases. It decomposes the approach into Multiplicative GRAPE (norm-preserving SO(d) rotations via rank-2 generators) and Additive GRAPE (unipotent actions in GL with homogeneous lifts), showing RoPE and ALiBi (and FoX) as exact instances. The framework enables multi-subspace extensions (GRAPE-M) and path-integral additive biases (GRAPE-AP), all with exact relative laws and efficient streaming. Empirical results on large language modeling demonstrate GRAPE variants outperforming RoPE and FoX while offering improved stability and scalability.

Abstract

We present GRAPE (Group RepresentAtional Position Encoding), a unified framework for positional encoding based on group actions. GRAPE brings together two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in $\mathrm{SO}(d)$ and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group $\mathrm{GL}$. In Multiplicative GRAPE, a position $n \in \mathbb{Z}$ (or $t \in \mathbb{R}$) acts as $\mathbf{G}(n)=\exp(n\,ω\,\mathbf{L})$ with a rank-2 skew generator $\mathbf{L} \in \mathbb{R}^{d \times d}$, yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the $d/2$ planes are the canonical coordinate pairs with log-uniform spectrum. Learned commuting subspaces and compact non-commuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at $O(d)$ and $O(r d)$ cost per head, respectively. In Additive GRAPE, additive logits arise as rank-1 (or low-rank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Altogether, GRAPE supplies a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Project Page: https://github.com/model-architectures/GRAPE.

Group Representational Position Encoding

TL;DR

Abstract

and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group

. In Multiplicative GRAPE, a position

(or

) acts as

with a rank-2 skew generator

, yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the

planes are the canonical coordinate pairs with log-uniform spectrum. Learned commuting subspaces and compact non-commuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at

and

cost per head, respectively. In Additive GRAPE, additive logits arise as rank-1 (or low-rank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Altogether, GRAPE supplies a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Project Page: https://github.com/model-architectures/GRAPE.

Group Representational Position Encoding

TL;DR

Abstract

Group Representational Position Encoding

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (9)