Table of Contents
Fetching ...

ExPe: Exact Positional Encodings for Generative Transformer Models with Extrapolating Capabilities

Aleksis Datseris, Sylvia Vassileva, Ivan Koychev, Svetla Boytcheva

TL;DR

The paper addresses the challenge of extrapolating transformer context length beyond training sequences by introducing Exact Positional Encodings (ExPE) and a quantization-stable variant ExQPE. ExPE encodes exact positional information by overriding the first $l$ embedding dimensions with a position-dependent vector $p_n = S + \theta n$, applied to the Query and Key inputs to preserve compatibility via residual connections. Empirical results show ExPE and ExQPE offer improved extrapolation over sinusoidal encodings and competitive performance with RoPE, while maintaining or reducing computational cost, particularly in longer-context regimes. The work highlights practical benefits for long-context language modeling and suggests directions for scaling, benchmarks, and data to validate extrapolation at industrial scales.

Abstract

This paper introduces a novel approach to position embeddings in transformer models, named "Exact Positional Embeddings" (ExPE). An absolute positional embedding method that can extrapolate to sequences of lengths longer than the ones it was trained on. Traditional transformer models rely on absolute or relative position embeddings to incorporate positional information into token embeddings, which often struggle with extrapolation to sequences longer than those seen during training. Our proposed method utilizes a novel embedding strategy that encodes exact positional information by overriding specific dimensions of the embedding vectors, thereby enabling a more precise representation of token positions. The proposed approach not only maintains the integrity of the original embeddings but also enhances the model's ability to generalize to more extended sequences. In causal language modeling, our ExPE embeddings significantly reduce perplexity compared to rotary and sinusoidal embeddings, when tested on sequences longer than those used in training.

ExPe: Exact Positional Encodings for Generative Transformer Models with Extrapolating Capabilities

TL;DR

The paper addresses the challenge of extrapolating transformer context length beyond training sequences by introducing Exact Positional Encodings (ExPE) and a quantization-stable variant ExQPE. ExPE encodes exact positional information by overriding the first embedding dimensions with a position-dependent vector , applied to the Query and Key inputs to preserve compatibility via residual connections. Empirical results show ExPE and ExQPE offer improved extrapolation over sinusoidal encodings and competitive performance with RoPE, while maintaining or reducing computational cost, particularly in longer-context regimes. The work highlights practical benefits for long-context language modeling and suggests directions for scaling, benchmarks, and data to validate extrapolation at industrial scales.

Abstract

This paper introduces a novel approach to position embeddings in transformer models, named "Exact Positional Embeddings" (ExPE). An absolute positional embedding method that can extrapolate to sequences of lengths longer than the ones it was trained on. Traditional transformer models rely on absolute or relative position embeddings to incorporate positional information into token embeddings, which often struggle with extrapolation to sequences longer than those seen during training. Our proposed method utilizes a novel embedding strategy that encodes exact positional information by overriding specific dimensions of the embedding vectors, thereby enabling a more precise representation of token positions. The proposed approach not only maintains the integrity of the original embeddings but also enhances the model's ability to generalize to more extended sequences. In causal language modeling, our ExPE embeddings significantly reduce perplexity compared to rotary and sinusoidal embeddings, when tested on sequences longer than those used in training.

Paper Structure

This paper contains 18 sections, 16 equations, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Flowchart showing when ExPE is applied in self-attention.
  • Figure 2: Flowchart showing when ExPE is applied in self-attention. ExPE gets applied to the input. Here, we assume self-attention, i.e., the Query, Key, and Value matrices are identical.