PoPE: Legendre Orthogonal Polynomials Based Position Encoding for Large Language Models

Arpit Aggarwal

PoPE: Legendre Orthogonal Polynomials Based Position Encoding for Large Language Models

Arpit Aggarwal

TL;DR

The paper addresses the limitations of traditional absolute and relative positional encodings in high-dimensional transformer representations. It introduces PoPE, a Legendre polynomial-based encoding, leveraging orthogonality, non-periodicity, and a three-term recurrence to provide a denser, more informative positional representation that supports learning relative positions. Empirically, PoPE improves Multi30K English-to-German translation BLEU to $40.7$ and accelerates convergence by $2$–$3\times$ relative to sinusoidal baselines, with theoretical arguments explaining the improved representation and bias reduction. This approach offers a principled, scalable alternative for positional encoding with potential applicability beyond the original Transformer, contributing to more efficient training and better generalization of long-range dependencies.

Abstract

There are several improvements proposed over the baseline Absolute Positional Encoding (APE) method used in original transformer. In this study, we aim to investigate the implications of inadequately representing positional encoding in higher dimensions on crucial aspects of the attention mechanism, the model's capacity to learn relative positional information, and the convergence of models, all stemming from the choice of sinusoidal basis functions. Through a combination of theoretical insights and empirical analyses, we elucidate how these challenges extend beyond APEs and may adversely affect the performance of Relative Positional Encoding (RPE) methods, such as Rotatory Positional Encoding (RoPE). Subsequently, we introduce an innovative solution termed Orthogonal Polynomial Based Positional Encoding (PoPE) to address some of the limitations associated with existing methods. The PoPE method encodes positional information by leveraging Orthogonal Legendre polynomials. Legendre polynomials as basis functions offers several desirable properties for positional encoding, including improved correlation structure, non-periodicity, orthogonality, and distinct functional forms among polynomials of varying orders. Our experimental findings demonstrate that transformer models incorporating PoPE outperform baseline transformer models on the $Multi30k$ English-to-German translation task, thus establishing a new performance benchmark. Furthermore, PoPE-based transformers exhibit significantly accelerated convergence rates. Additionally, we will present novel theoretical perspectives on position encoding based on the superior performance of PoPE.

PoPE: Legendre Orthogonal Polynomials Based Position Encoding for Large Language Models

TL;DR

and accelerates convergence by

–

relative to sinusoidal baselines, with theoretical arguments explaining the improved representation and bias reduction. This approach offers a principled, scalable alternative for positional encoding with potential applicability beyond the original Transformer, contributing to more efficient training and better generalization of long-range dependencies.

Abstract

English-to-German translation task, thus establishing a new performance benchmark. Furthermore, PoPE-based transformers exhibit significantly accelerated convergence rates. Additionally, we will present novel theoretical perspectives on position encoding based on the superior performance of PoPE.

Paper Structure (16 sections, 16 equations, 4 figures, 1 table)

This paper contains 16 sections, 16 equations, 4 figures, 1 table.

Introduction
Prelude to PoPE
Background and Framework
Empirical Analysis
Mathematical Analysis
Proposed Method: PoPE
Legendre Polynomials
Proposed Formulation
Properties of PoPE
Experiments and Results
Data and task
Model and Hardware
Results
Discussion
Limitations and Future Scope
...and 1 more sections

Figures (4)

Figure 1: Low variance at higher dimensional values of sinusoidal positional encoding (a), and near perfect correlation among encoding of different token positions (b)
Figure 2: First four order Legendre Polynomials, credit: $Mathworld$
Figure 3: PoPE has much better representation (a), correlation among token positions are much better managed (b)
Figure 4: Training loss convergecne with and wihout PoPE

PoPE: Legendre Orthogonal Polynomials Based Position Encoding for Large Language Models

TL;DR

Abstract

PoPE: Legendre Orthogonal Polynomials Based Position Encoding for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)