Table of Contents
Fetching ...

PaTH Attention: Position Encoding via Accumulating Householder Transformations

Songlin Yang, Yikang Shen, Kaiyue Wen, Shawn Tan, Mayank Mishra, Liliang Ren, Rameswar Panda, Yoon Kim

TL;DR

PaTH Attention introduces a data-dependent multiplicative position encoding by accumulating Householder-like transformations along position paths, improving over RoPE's fixed, input-agnostic rotations. The method uses an identity-plus-rank-one structure to produce data-driven transition matrices that adapt to input, linking to expressive linear RNNs while maintaining a softmax attention formulation. A FlashAttention-style blockwise implementation enables hardware-efficient training, and extensions like PaTH-FoX combine forgetting gates for further gains on long contexts. Across synthetic state-tracking tasks and moderate-scale language modeling, PaTH and PaTH-FoX outperform RoPE and other baselines, demonstrating improved long-context generalization and state-tracking capabilities with scalable performance gains.

Abstract

The attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for modeling structured domains such as language. Rotary position encoding (RoPE) has emerged as the de facto standard approach for position encoding and is part of many modern LLMs. However, in RoPE the key/query transformation between two elements in a sequence is only a function of their relative position and otherwise independent of the actual input. This limits the expressivity of RoPE-based transformers. This paper describes PaTH, a flexible data-dependent position encoding scheme based on accumulated products of Householder(like) transformations, where each transformation is data-dependent, i.e., a function of the input. We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices, and implement a FlashAttention-style blockwise algorithm that minimizes I/O cost. Across both targeted synthetic benchmarks and moderate-scale real-world language modeling experiments, we find that PaTH demonstrates superior performance compared to RoPE and other recent baselines.

PaTH Attention: Position Encoding via Accumulating Householder Transformations

TL;DR

PaTH Attention introduces a data-dependent multiplicative position encoding by accumulating Householder-like transformations along position paths, improving over RoPE's fixed, input-agnostic rotations. The method uses an identity-plus-rank-one structure to produce data-driven transition matrices that adapt to input, linking to expressive linear RNNs while maintaining a softmax attention formulation. A FlashAttention-style blockwise implementation enables hardware-efficient training, and extensions like PaTH-FoX combine forgetting gates for further gains on long contexts. Across synthetic state-tracking tasks and moderate-scale language modeling, PaTH and PaTH-FoX outperform RoPE and other baselines, demonstrating improved long-context generalization and state-tracking capabilities with scalable performance gains.

Abstract

The attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for modeling structured domains such as language. Rotary position encoding (RoPE) has emerged as the de facto standard approach for position encoding and is part of many modern LLMs. However, in RoPE the key/query transformation between two elements in a sequence is only a function of their relative position and otherwise independent of the actual input. This limits the expressivity of RoPE-based transformers. This paper describes PaTH, a flexible data-dependent position encoding scheme based on accumulated products of Householder(like) transformations, where each transformation is data-dependent, i.e., a function of the input. We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices, and implement a FlashAttention-style blockwise algorithm that minimizes I/O cost. Across both targeted synthetic benchmarks and moderate-scale real-world language modeling experiments, we find that PaTH demonstrates superior performance compared to RoPE and other recent baselines.

Paper Structure

This paper contains 23 sections, 3 theorems, 23 equations, 4 figures, 5 tables.

Key Result

Theorem 2.1

A one-layer PaTH transformer with two attention heads and $\log n$ precision can solve an $\mathsf{NC}^1$-complete problem under $\mathsf{AC}^0$-reductions.

Figures (4)

  • Figure 1: FFLM error rate (%) on ID/OOD test sets. All models are 1-layer, 2-head, 64-dim.
  • Figure 2: Length extrapolation results for 760M models trained on 50B tokens with 4096 context length.
  • Figure 3: RULER results grouped by different task categories.
  • Figure 4: BABILong performance breakdowns. QA1: Single supporting fact. QA2: Two supporting facts. QA3: Three supporting facts. QA4: Two arg relations. QA5: Three arg relations.

Theorems & Definitions (5)

  • Theorem 2.1
  • Theorem A.1
  • proof
  • Theorem A.1
  • proof