Retrieval Quality at Context Limit
Max McKinnon
TL;DR
The paper investigates whether the Lost in the Middle effect persists in long-context retrieval using Gemini 2.5 Flash, a model with a 1M+ token context window. By embedding 26 needle-in-a-haystack facts within a 1M+-token Friends transcript and testing 26 corresponding Q&A pairs, the authors show near-perfect recall up to about 70% of the transcript, with failures arising only when the input size exceeded the model's token limit. They compare to prior LITM findings and discuss architectural or training factors such as ALiBi that may mitigate middle-context degradation, while acknowledging limitations to simple factoid QA and text-only inputs. The results suggest substantial advances in long-context retrieval for single-factoid questions, informing practical use and guiding future work toward multi-needle, multi-modal, and reasoning tasks. Practical impact includes improved reliability of long-document Q&A and retrieval-augmented workflows at scale, though generalization to more complex retrieval tasks remains an open area.
Abstract
The ability of large language models (LLMs) to recall and retrieve information from long contexts is critical for many real-world applications. Prior work (Liu et al., 2023) reported that LLMs suffer significant drops in retrieval accuracy for facts placed in the middle of large contexts, an effect known as "Lost in the Middle" (LITM). We find the model Gemini 2.5 Flash can answer needle-in-a-haystack questions with great accuracy regardless of document position including when the document is nearly at the input context limit. Our results suggest that the "Lost in the Middle" effect is not present for simple factoid Q\&A in Gemini 2.5 Flash, indicating substantial improvements in long-context retrieval.
