Table of Contents
Fetching ...

When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training

Haonan Wang, Qian Liu, Chao Du, Tongyao Zhu, Cunxiao Du, Kenji Kawaguchi, Tianyu Pang

TL;DR

AnchorAttention is developed, a plug-and-play attention method that alleviates numerical issues caused by BFloat16, improves long-context capabilities, and speeds up training, and reduces training time by over 50% compared to standard full attention mechanisms.

Abstract

Extending context window sizes allows large language models (LLMs) to process longer sequences and handle more complex tasks. Rotary Positional Embedding (RoPE) has become the de facto standard due to its relative positional encoding properties that benefit long-context training. However, we observe that using RoPE with BFloat16 format results in numerical issues, causing it to deviate from its intended relative positional encoding, especially in long-context scenarios. This issue arises from BFloat16's limited precision and accumulates as context length increases, with the first token contributing significantly to this problem. To address this, we develop AnchorAttention, a plug-and-play attention method that alleviates numerical issues caused by BFloat16, improves long-context capabilities, and speeds up training. AnchorAttention reduces unnecessary attention computations, maintains semantic coherence, and boosts computational efficiency by treating the first token as a shared anchor with a consistent position ID, making it visible to all documents within the training context. Experiments on three types of LLMs demonstrate that AnchorAttention significantly improves long-context performance and reduces training time by over 50\% compared to standard full attention mechanisms, while preserving the original LLM's capabilities on general tasks. Our code is available at https://github.com/haonan3/AnchorContext.

When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training

TL;DR

AnchorAttention is developed, a plug-and-play attention method that alleviates numerical issues caused by BFloat16, improves long-context capabilities, and speeds up training, and reduces training time by over 50% compared to standard full attention mechanisms.

Abstract

Extending context window sizes allows large language models (LLMs) to process longer sequences and handle more complex tasks. Rotary Positional Embedding (RoPE) has become the de facto standard due to its relative positional encoding properties that benefit long-context training. However, we observe that using RoPE with BFloat16 format results in numerical issues, causing it to deviate from its intended relative positional encoding, especially in long-context scenarios. This issue arises from BFloat16's limited precision and accumulates as context length increases, with the first token contributing significantly to this problem. To address this, we develop AnchorAttention, a plug-and-play attention method that alleviates numerical issues caused by BFloat16, improves long-context capabilities, and speeds up training. AnchorAttention reduces unnecessary attention computations, maintains semantic coherence, and boosts computational efficiency by treating the first token as a shared anchor with a consistent position ID, making it visible to all documents within the training context. Experiments on three types of LLMs demonstrate that AnchorAttention significantly improves long-context performance and reduces training time by over 50\% compared to standard full attention mechanisms, while preserving the original LLM's capabilities on general tasks. Our code is available at https://github.com/haonan3/AnchorContext.

Paper Structure

This paper contains 31 sections, 10 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Effects of positional shifts on attention computations under different settings. Left: Attention difference $D$ (Eq. \ref{['equ:attn_score_diff']}) plotted against varying positional shift $\Delta_1$ (with $\Delta_2 = 16$ fixed). Pretrained models under BFloat16 (blue line) exhibit significant discrepancies compared to Float32 (yellow line) and random initialization (green line), indicating that the relative positional encoding property of RoPE is broken under BFloat16 and that pretraining amplifies this effect. Middle: Per-token attention differences between $\Delta_1 = 0$ and $\Delta_2 = 16$, highlighting the first token accounts for most of the attention difference observed. Right: Attention logit difference (Eq. \ref{['equ:attn_logit_diff']}) for the first token as sequence length increases, showing increased discrepancies with longer sequences.
  • Figure 2: Illustrations of different attention paradigms. Left: Standard intra-document attention. Middle: Our improved version, intra-document attention with position ID reset per document. Right: AnchorAttention incorporating a shared anchor token, $\mathscr{A}$.
  • Figure 3: Resetting position IDs improves performance, contradicting theoretical predictions of RoPE.
  • Figure 4: RULER performance varies during long-context training, we recommend reporting the averaged RULER performance rather than just the final training step. PPL remains unchanged after the first several steps, failing to reflect improvements in long-context ability.
  • Figure 5: Illustrations of domain tagging and interleaved chunks. Left: AnchorAttention with domain tagging, where $\mathscr{T}_1$ denotes the domain of document $\textbf{d}_1$. Middle: Intra-document attention with interleaved chunks; documents are split into shuffled, interleaved chunks, preserving the original order within each document. Right: AnchorAttention with interleaved chunks.
  • ...and 3 more figures