What Works for 'Lost-in-the-Middle' in LLMs? A Study on GM-Extract and Mitigations
Mihir Gupte, Eshan Dixit, Muhammad Tayyab, Arun Adiththan
TL;DR
This work investigates the lost-in-the-middle problem in LLMs using GM-Extract, a benchmark with 92 control variables and two downstream tasks (Key-Value Extraction and Question-Answering). It introduces two diagnostic metrics—Document Metric for spatial retrieval and Variable Extraction Metric for semantic retrieval—and systematically evaluates white-box and black-box mitigation methods across models with varying context windows. Key findings show that data representation strongly influences retrieval and perplexity correlates with performance; long-context pretraining improves short-context performance but spatial awareness degrades as context fills, and mitigation methods offer only modest, highly conditional gains. The study underscores fundamental limits tied to attention distribution and highlights the need for pretraining curriculum adjustments and data-format considerations to enhance robust long-context retrieval in real-world applications.
Abstract
The diminishing ability of large language models (LLMs) to effectively utilize long-range context-the "lost-in-the-middle" phenomenon-poses a significant challenge in retrieval-based LLM applications. To study the impact of this phenomenon in a real-world application setting, we introduce GM-Extract, a novel benchmark dataset meticulously designed to evaluate LLM performance on retrieval of control variables. To accurately diagnose failure modes, we propose a simple yet elegant evaluation system using two distinct metrics: one for spatial retrieval capability (Document Metric) and the other for semantic retrieval capability (Variable Extraction Metric). We conduct a systematic evaluation of 7-8B parameter models on two multi-document tasks (key-value extraction and question-answering), demonstrating a significant change in retrieval performance simply by altering how the data is represented in the context window. While a distinct U-shaped curve was not consistently observed, our analysis reveals a clear pattern of performance across models, which we further correlate with perplexity scores. Furthermore, we perform a literature survey of mitigation methods, which we categorize into two distinct approaches: black-box and white-box methods. We then apply these techniques to our benchmark, finding that their efficacy is highly nuanced. Our evaluation highlights scenarios where these strategies successfully improve performance, as well as surprising cases where they lead to a negative impact, providing a comprehensive understanding of their utility in a practical context.
