Table of Contents
Fetching ...

Attention Sorting Combats Recency Bias In Long Context Language Models

Alexander Peysakhovich, Adam Lerer

TL;DR

The paper investigates why long-context language models struggle to utilize distant information, identifying training-time attention priors and RoPE-induced recency bias as key factors. It introduces SynthWiki, a synthetic long-context QA dataset, to analyze context utilization under distractor pressure while minimizing pretraining contamination. A decode-time technique called attention sorting demonstrates that reordering documents based on initial attention improves extraction accuracy across several open-source and API models, with especially strong gains for models tuned for long-context QA. The work highlights the potential and limits of permutation-based context refinement and points toward more principled integration of retrieval with long-context fine-tuning for real RAG tasks.

Abstract

Current language models often fail to incorporate long contexts efficiently during generation. We show that a major contributor to this issue are attention priors that are likely learned during pre-training: relevant information located earlier in context is attended to less on average. Yet even when models fail to use the information from a relevant document in their response, they still pay preferential attention to that document compared to an irrelevant document at the same position. We leverage this fact to introduce ``attention sorting'': perform one step of decoding, sort documents by the attention they receive (highest attention going last), repeat the process, generate the answer with the newly sorted context. We find that attention sorting improves performance of long context models. Our findings highlight some challenges in using off-the-shelf language models for retrieval augmented generation.

Attention Sorting Combats Recency Bias In Long Context Language Models

TL;DR

The paper investigates why long-context language models struggle to utilize distant information, identifying training-time attention priors and RoPE-induced recency bias as key factors. It introduces SynthWiki, a synthetic long-context QA dataset, to analyze context utilization under distractor pressure while minimizing pretraining contamination. A decode-time technique called attention sorting demonstrates that reordering documents based on initial attention improves extraction accuracy across several open-source and API models, with especially strong gains for models tuned for long-context QA. The work highlights the potential and limits of permutation-based context refinement and points toward more principled integration of retrieval with long-context fine-tuning for real RAG tasks.

Abstract

Current language models often fail to incorporate long contexts efficiently during generation. We show that a major contributor to this issue are attention priors that are likely learned during pre-training: relevant information located earlier in context is attended to less on average. Yet even when models fail to use the information from a relevant document in their response, they still pay preferential attention to that document compared to an irrelevant document at the same position. We leverage this fact to introduce ``attention sorting'': perform one step of decoding, sort documents by the attention they receive (highest attention going last), repeat the process, generate the answer with the newly sorted context. We find that attention sorting improves performance of long context models. Our findings highlight some challenges in using off-the-shelf language models for retrieval augmented generation.
Paper Structure (10 sections, 7 figures)

This paper contains 10 sections, 7 figures.

Figures (7)

  • Figure 1: Performance of all long-context models that we study on question answering degrades when the relevant information is embedded in a long context of irrelevant distractor text.
  • Figure 2: QA accuracy on SynthWiki as a function of the position of the relevant document in the context. We see a replication of the 'lost in the middle'liu2023lost effect on this dataset in which accuracy is lower when the relevant information is in the middle of a long context. The recency (information toward the end of the context) effect seems to be quite general across models and context lengths. However, the primacy (first documents) effect seems to be less general.
  • Figure 3: Average attention weight by source token position for different context lengths, averaged over all layers and attention heads. The attention weights are only computed for the first generated response token. At long context lengths, all three models show a strong bias towards attending to the most recent tokens, as well as a weaker bias towards the initial tokens. All models also attend much more strongly to relevant documents than distractor documents.
  • Figure 4: An illustration of the attention sorting procedure. Average per-document attention is computed for the first generated response token, and then documents are sorted in context with the highest attention at the end. After k rounds of this sorting procedure, the response is generated.
  • Figure 5: The effect of attention sorting on SynthWiki. Attention sorting increases small model performance. For the TogetherLlama-Instruct model, re-sorting recovers most of the performance degradation from long context and matches the performance of Claude-2.
  • ...and 2 more figures