Table of Contents
Fetching ...

Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinskü

TL;DR

The paper tackles the lack of principled understanding for positional encoding in transformers and its impact on context length extrapolation. By recasting self-attention as a Bayesian mechanism, BAM treats PE as a prior over token positions, unifying NoPE, ALiBi, and related approaches under a single probabilistic framework. The authors introduce a Generalized Gaussian prior (GGD-BAM) that, with a small parameter count, substantially improves long-context retrieval and maintains competitive perplexity, validated on passkey retrieval and downstream benchmarks. The work provides both theoretical insights and practical methods for learning and visualizing attention priors, with scalable softmax compatibility, offering a path to more robust and interpretable long-context transformers.

Abstract

Transformer-based language models rely on positional encoding (PE) to handle token order and support context length extrapolation. However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims. We propose the Bayesian Attention Mechanism (BAM), a theoretical framework that formulates positional encoding as a prior within a probabilistic model. BAM unifies existing methods (e.g., NoPE and ALiBi) and motivates a new Generalized Gaussian positional prior that substantially improves long-context generalization. Empirically, BAM enables accurate information retrieval at $500\times$ the training context length, outperforming previous state-of-the-art context length generalization in long context retrieval accuracy while maintaining comparable perplexity and introducing minimal additional parameters.

Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

TL;DR

The paper tackles the lack of principled understanding for positional encoding in transformers and its impact on context length extrapolation. By recasting self-attention as a Bayesian mechanism, BAM treats PE as a prior over token positions, unifying NoPE, ALiBi, and related approaches under a single probabilistic framework. The authors introduce a Generalized Gaussian prior (GGD-BAM) that, with a small parameter count, substantially improves long-context retrieval and maintains competitive perplexity, validated on passkey retrieval and downstream benchmarks. The work provides both theoretical insights and practical methods for learning and visualizing attention priors, with scalable softmax compatibility, offering a path to more robust and interpretable long-context transformers.

Abstract

Transformer-based language models rely on positional encoding (PE) to handle token order and support context length extrapolation. However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims. We propose the Bayesian Attention Mechanism (BAM), a theoretical framework that formulates positional encoding as a prior within a probabilistic model. BAM unifies existing methods (e.g., NoPE and ALiBi) and motivates a new Generalized Gaussian positional prior that substantially improves long-context generalization. Empirically, BAM enables accurate information retrieval at the training context length, outperforming previous state-of-the-art context length generalization in long context retrieval accuracy while maintaining comparable perplexity and introducing minimal additional parameters.

Paper Structure

This paper contains 62 sections, 52 equations, 18 figures, 8 tables.

Figures (18)

  • Figure 1: Visual comparison of different positional priors $p(g_{\text{pos}}(i,j))$ in BAM. Each curve represents the distribution over past token positions for a fixed query $\mathbf{q}_{i}$ in a fixed token position $i$.
  • Figure 2: Visual representation of the scoring function in GGD-BAM. The first matrix accounts for the content and the two others for the Uniform and GGD positional priors.
  • Figure 3: Passkey retrieval accuracy with distinct PE. BAM SSMax outperforms all PE methods maintaining perfect accuracy for a context beyond $64\times$ the training context length.
  • Figure 4: Passkey retrieval accuracy across context lengths and depths. The horizontal axis represent context length and the vertical axis represents the position of passkey in the context. In the bottom row and last column, we see average accuracy across length and position, respectively.
  • Figure 5: Attention weights from GGD-BAM during the Passkey Retrieval task. When $\beta\leq0$, attention concentrates on distant keys (e.g., the passkey tokens), suppressing nearby content.
  • ...and 13 more figures

Theorems & Definitions (8)

  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof