Attention Sorting Combats Recency Bias In Long Context Language Models
Alexander Peysakhovich, Adam Lerer
TL;DR
The paper investigates why long-context language models struggle to utilize distant information, identifying training-time attention priors and RoPE-induced recency bias as key factors. It introduces SynthWiki, a synthetic long-context QA dataset, to analyze context utilization under distractor pressure while minimizing pretraining contamination. A decode-time technique called attention sorting demonstrates that reordering documents based on initial attention improves extraction accuracy across several open-source and API models, with especially strong gains for models tuned for long-context QA. The work highlights the potential and limits of permutation-based context refinement and points toward more principled integration of retrieval with long-context fine-tuning for real RAG tasks.
Abstract
Current language models often fail to incorporate long contexts efficiently during generation. We show that a major contributor to this issue are attention priors that are likely learned during pre-training: relevant information located earlier in context is attended to less on average. Yet even when models fail to use the information from a relevant document in their response, they still pay preferential attention to that document compared to an irrelevant document at the same position. We leverage this fact to introduce ``attention sorting'': perform one step of decoding, sort documents by the attention they receive (highest attention going last), repeat the process, generate the answer with the newly sorted context. We find that attention sorting improves performance of long context models. Our findings highlight some challenges in using off-the-shelf language models for retrieval augmented generation.
