Table of Contents
Fetching ...

Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization

Cheng-Yu Hsieh, Yung-Sung Chuang, Chun-Liang Li, Zifeng Wang, Long T. Le, Abhishek Kumar, James Glass, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna, Tomas Pfister

TL;DR

This work investigates why large language models struggle to retrieve information located in the middle of long inputs. The authors link the phenomenon to a robust U-shaped positional attention bias, showing that early and late input regions attract more attention regardless of content. They propose a calibration method, found-in-the-middle, to disentangle bias from true relevance and demonstrate that calibrated attention improves the model’s ability to locate middle-context information and enhances RAG performance by up to about 15 percentage points across tasks and models. The method is inference-time and can complement reordering-based pipelines, offering a principled way to improve long-context utilization in practical deployments. These findings provide a deeper understanding of attention biases in LLMs and lay groundwork for more reliable long-context reasoning.

Abstract

Large language models (LLMs), even when specifically trained to process long input contexts, struggle to capture relevant information located in the middle of their input. This phenomenon has been known as the lost-in-the-middle problem. In this work, we make three contributions. First, we set out to understand the factors that cause this phenomenon. In doing so, we establish a connection between lost-in-the-middle to LLMs' intrinsic attention bias: LLMs exhibit a U-shaped attention bias where the tokens at the beginning and at the end of its input receive higher attention, regardless of their relevance. Second, we mitigate this positional bias through a calibration mechanism, found-in-the-middle, that allows the model to attend to contexts faithfully according to their relevance, even though when they are in the middle. Third, we show found-in-the-middle not only achieves better performance in locating relevant information within a long context, but also eventually leads to improved retrieval-augmented generation (RAG) performance across various tasks, outperforming existing methods by up to 15 percentage points. These findings open up future directions in understanding LLM attention bias and its potential consequences.

Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization

TL;DR

This work investigates why large language models struggle to retrieve information located in the middle of long inputs. The authors link the phenomenon to a robust U-shaped positional attention bias, showing that early and late input regions attract more attention regardless of content. They propose a calibration method, found-in-the-middle, to disentangle bias from true relevance and demonstrate that calibrated attention improves the model’s ability to locate middle-context information and enhances RAG performance by up to about 15 percentage points across tasks and models. The method is inference-time and can complement reordering-based pipelines, offering a principled way to improve long-context utilization in practical deployments. These findings provide a deeper understanding of attention biases in LLMs and lay groundwork for more reliable long-context reasoning.

Abstract

Large language models (LLMs), even when specifically trained to process long input contexts, struggle to capture relevant information located in the middle of their input. This phenomenon has been known as the lost-in-the-middle problem. In this work, we make three contributions. First, we set out to understand the factors that cause this phenomenon. In doing so, we establish a connection between lost-in-the-middle to LLMs' intrinsic attention bias: LLMs exhibit a U-shaped attention bias where the tokens at the beginning and at the end of its input receive higher attention, regardless of their relevance. Second, we mitigate this positional bias through a calibration mechanism, found-in-the-middle, that allows the model to attend to contexts faithfully according to their relevance, even though when they are in the middle. Third, we show found-in-the-middle not only achieves better performance in locating relevant information within a long context, but also eventually leads to improved retrieval-augmented generation (RAG) performance across various tasks, outperforming existing methods by up to 15 percentage points. These findings open up future directions in understanding LLM attention bias and its potential consequences.
Paper Structure (34 sections, 7 equations, 6 figures, 5 tables)

This paper contains 34 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: (a) Lost-in-the-middle refers to models' U-shape RAG performance as the relevant context's (e.g., a gold document containing the answer to a query) position varies within the input; (b) We observe models exhibit U-shape attention weights favoring leading and ending contexts, regardless of their actual contents; (c) Models do attend to relevant contexts even when placed in the middle, but are eventually distracted by leading/ending contexts; (d) We propose a calibration mechanism, found-in-the-middle, that disentangles the effect of U-shape attention bias and allows models to attend to relevant context regardless their positions.
  • Figure 2: Left and Middle: Qualitatively, the model's response exhibits a strong bias towards the document at the first position (red). This persists whether the input documents retain their original order (left: gold document at the 10th position) or are randomly shuffled (middle: gold document at the 13th position). Model responses are shown in green, with the gold answer highlighted in yellow. Right: Our attention calibration method enables the model to find relevant context even when placed in the middle.
  • Figure 3: Quantitatively, the model's response strongly depends on the document at the first position. This dependence persists even after randomly shuffling the document order, irrespective of its relevance to the query. We measure this dependence by computing the TF-IDF similarity score between the response and each document (gold document originally at position 10).
  • Figure 4: Average attention weights reveal a U-shaped positional bias in the model. Documents at the beginning and end receive greater attention, regardless of order (gold document originally at position 10). Attention is averaged across different decoder layers and attention heads.
  • Figure 5: Attention calibration effectively improves models' context utilization ability, with its performance curves lying almost entirely above standard vanilla attention (on 22 out of 24 cases). On the most challenging settings where the gold documents are placed in the middle, attention calibration provides 6-15 points improvements. Top/Bottom row: 10/20-doc. Numbers shown in Table \ref{['table:rag_table']}.
  • ...and 1 more figures