Table of Contents
Fetching ...

Lost in the Middle: An Emergent Property from Information Retrieval Demands in LLMs

Nikolaus Salvatore, Hao Wang, Qiong Zhang

TL;DR

This work reframes the lost-in-the-middle phenomenon in LLMs as an emergent adaptation to information retrieval demands during pre-training rather than a pure failure. By training GPT-2 and Llama models from scratch on simple human memory paradigms (Free Recall, Running Span) and a Masked Sequence Completion task, the authors show primacy arises under uniform long-term memory demand, recency under end-weighted short-term demand, and a canonical U-shaped pattern when both demands are present. They demonstrate that architectural biases (autoregressive processing) and attention dynamics (attention sinks) shape these effects, and that ablating sinks selectively disrupts long-term retrieval while leaving short-term performance relatively intact. The findings extend to sequence completion tasks, offering a unified account of positional biases as rational, task-driven adaptations with implications for mitigation and evaluation in LLMs. Overall, the paper links cognitive-inspired memory demands, transformer attention mechanics, and model architecture to explain and potentially control loss-of-middle effects in large language models.

Abstract

The performance of Large Language Models (LLMs) often degrades when crucial information is in the middle of a long context, a "lost-in-the-middle" phenomenon that mirrors the primacy and recency effects in human memory. We propose that this behavior is not simply a flaw indicative of information loss but an adaptation to different information retrieval demands during pre-training: some tasks require uniform recall across the entire input (a long-term memory demand), while others prioritize the most recent information (a short-term memory demand). Consistent with this view, we show that this U-shaped performance curve emerges when LLMs (GPT-2 and Llama variants) are trained from scratch on two simple human memory paradigms simulating long-term and short-term memory demands. Our analysis reveals that while the recency effect directly aligns with short-term memory demand in the training data, the primacy effect is induced by the uniform long-term memory demand and is additionally influenced by the model's autoregressive properties and the formation of attention sinks. Our main findings from simple human memory paradigms also generalize to a sequence completion task, which more closely resembles the next-token prediction process in LLM pre-training. Together, our findings reveal how information retrieval demands, model architecture, and structural attention dynamics during model training can jointly produce positional bias observed in LLMs.

Lost in the Middle: An Emergent Property from Information Retrieval Demands in LLMs

TL;DR

This work reframes the lost-in-the-middle phenomenon in LLMs as an emergent adaptation to information retrieval demands during pre-training rather than a pure failure. By training GPT-2 and Llama models from scratch on simple human memory paradigms (Free Recall, Running Span) and a Masked Sequence Completion task, the authors show primacy arises under uniform long-term memory demand, recency under end-weighted short-term demand, and a canonical U-shaped pattern when both demands are present. They demonstrate that architectural biases (autoregressive processing) and attention dynamics (attention sinks) shape these effects, and that ablating sinks selectively disrupts long-term retrieval while leaving short-term performance relatively intact. The findings extend to sequence completion tasks, offering a unified account of positional biases as rational, task-driven adaptations with implications for mitigation and evaluation in LLMs. Overall, the paper links cognitive-inspired memory demands, transformer attention mechanics, and model architecture to explain and potentially control loss-of-middle effects in large language models.

Abstract

The performance of Large Language Models (LLMs) often degrades when crucial information is in the middle of a long context, a "lost-in-the-middle" phenomenon that mirrors the primacy and recency effects in human memory. We propose that this behavior is not simply a flaw indicative of information loss but an adaptation to different information retrieval demands during pre-training: some tasks require uniform recall across the entire input (a long-term memory demand), while others prioritize the most recent information (a short-term memory demand). Consistent with this view, we show that this U-shaped performance curve emerges when LLMs (GPT-2 and Llama variants) are trained from scratch on two simple human memory paradigms simulating long-term and short-term memory demands. Our analysis reveals that while the recency effect directly aligns with short-term memory demand in the training data, the primacy effect is induced by the uniform long-term memory demand and is additionally influenced by the model's autoregressive properties and the formation of attention sinks. Our main findings from simple human memory paradigms also generalize to a sequence completion task, which more closely resembles the next-token prediction process in LLM pre-training. Together, our findings reveal how information retrieval demands, model architecture, and structural attention dynamics during model training can jointly produce positional bias observed in LLMs.

Paper Structure

This paper contains 18 sections, 10 equations, 7 figures.

Figures (7)

  • Figure 1: (A) The "lost-in-the-middle" behavior in LLMs, where accuracy drops significantly for information near the center of the context window. (B) Serial position effects in human memory, where items from the beginning (primacy) and end (recency) of a study list are recalled with higher accuracy, producing a characteristic U-shaped curve.
  • Figure 2: Lost-in-the-middle behavior in LLMs arises from adaptations to short-term and long-term memory demands during training. (A) The free recall task involves recalling all items from the presented sequence in any order, which places a long-term memory demand equally across the entire list. (B) The running span task involves recalling the last $N$ items preceding a specified location (i.e., recall token), which places a short-term memory demand on only the most recent information. (C) Our findings reveal that when LLMs are trained jointly on both tasks from scratch, lost-in-the-middle behavior emerges.
  • Figure 3: Recall behavior results for all models across each task experiment. (A-C) Serial position curve, probability of first recall, and conditional response probability for each model on the free recall task. (D-F) Relative-to-end recall probability (i.e., recall probability for positions offset from the <RECALL_n> token), probability of first recall, and conditional response probability for each model on the running span task. (G-I) Serial position curve (free recall response), probability of first recall (free recall response), and relative-to-end recall probability (running span response) when models are trained simultaneously on the free recall and the running span tasks.
  • Figure 4: Free recall behavior for alternative model architectures. (A-C) Free recall behavior for an RNN-based seq2seq model. This is an example of another autoregressive model that exhibits the primacy effect similar to decoder-only LLMs. (D-F) Free recall behavior for T5. This encoder-decoder model exhibits a flat recall curve and a uniform probability of first recall.
  • Figure 5: Attention sink and head ablation behavioral results. (A-C) These attention heatmaps show attention scores for sample heads identified as sinks at various thresholds. At $\epsilon = 0.8$, we see a clear attention sink form and use this threshold for ablation testing. (D-F) Recall behavior curves for each model on each task before and after attention sink head dropout. Both free recall and combined tasks show significant drops in performance, both at the primacy region and across the entire list. (G) Each bar represents the averaged recall accuracy of a model on a given task with or without attention sink dropout. For each pair of model-testing conditions, we perform a paired t-test (for aligned inputs) to determine the significance of the performance difference in the unablated and ablated performance metrics (* : $p < 0.05$, ** : $p < 0.01$, *** : $p < 0.001$, n.s. : not significant).
  • ...and 2 more figures