Table of Contents
Fetching ...

Understanding Synthetic Context Extension via Retrieval Heads

Xinyu Zhao, Fangcong Yin, Greg Durrett

TL;DR

This work investigates synthetic-context extension (SCE) as a post-training strategy to endow LLMs with long-context retrieval and reasoning capabilities. It identifies retrieval heads—specialized attention heads—as the mechanistic link between synthetic data and downstream performance, showing substantial overlap between heads learned on synthetic and real data and a strong correlation between head recall and task success. Through attention knockout and activation patching, the authors demonstrate that retrieval heads are necessary but not sufficient, offering mechanistic explanations for when synthetic data helps and guiding principled synthetic-data design. The findings provide a path toward engineering better synthetic data and understanding how it teaches Transformers to operate over long contexts, with implications for building cost-effective long-context LLMs.

Abstract

Long-context LLMs are increasingly in demand for applications such as retrieval-augmented generation. To defray the cost of pretraining LLMs over long contexts, recent work takes an approach of synthetic context extension: fine-tuning LLMs with synthetically generated long-context data in a post-training stage. However, it remains unclear how and why this synthetic context extension imparts abilities for downstream long-context tasks. In this paper, we investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We vary the realism of "needle" concepts to be retrieved and diversity of the surrounding "haystack" context, from using LLMs to construct synthetic documents to using templated relations and creating symbolic datasets. We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted and even predicted in terms of a special set of attention heads that are responsible for retrieval over long context, retrieval heads (Wu et al., 2024). The retrieval heads learned on synthetic data have high overlap with retrieval heads learned on real data, and there is a strong correlation between the recall of heads learned and the downstream performance of a model. Furthermore, with attention knockout and activation patching, we mechanistically show that retrieval heads are necessary and explain model performance, although they are not totally sufficient. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.

Understanding Synthetic Context Extension via Retrieval Heads

TL;DR

This work investigates synthetic-context extension (SCE) as a post-training strategy to endow LLMs with long-context retrieval and reasoning capabilities. It identifies retrieval heads—specialized attention heads—as the mechanistic link between synthetic data and downstream performance, showing substantial overlap between heads learned on synthetic and real data and a strong correlation between head recall and task success. Through attention knockout and activation patching, the authors demonstrate that retrieval heads are necessary but not sufficient, offering mechanistic explanations for when synthetic data helps and guiding principled synthetic-data design. The findings provide a path toward engineering better synthetic data and understanding how it teaches Transformers to operate over long contexts, with implications for building cost-effective long-context LLMs.

Abstract

Long-context LLMs are increasingly in demand for applications such as retrieval-augmented generation. To defray the cost of pretraining LLMs over long contexts, recent work takes an approach of synthetic context extension: fine-tuning LLMs with synthetically generated long-context data in a post-training stage. However, it remains unclear how and why this synthetic context extension imparts abilities for downstream long-context tasks. In this paper, we investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We vary the realism of "needle" concepts to be retrieved and diversity of the surrounding "haystack" context, from using LLMs to construct synthetic documents to using templated relations and creating symbolic datasets. We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted and even predicted in terms of a special set of attention heads that are responsible for retrieval over long context, retrieval heads (Wu et al., 2024). The retrieval heads learned on synthetic data have high overlap with retrieval heads learned on real data, and there is a strong correlation between the recall of heads learned and the downstream performance of a model. Furthermore, with attention knockout and activation patching, we mechanistically show that retrieval heads are necessary and explain model performance, although they are not totally sufficient. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.

Paper Structure

This paper contains 47 sections, 2 equations, 11 figures, 19 tables.

Figures (11)

  • Figure 1: We explore synthetic context extension with different forms of synthetic data across multiple tasks. Examples for a two-hop question from MuSiQue trivedi2021musique are shown here. A special set of attention heads, retrieval headswu2025retrieval, help explain the performance gap between fine-tuning on real data and synthetic data.
  • Figure 2: Examples of elements of synthetic datasets for MuSiQue with varying levels of concept expression and context diversity. The needle sentences $f_{i}$ in the context and the entities in them are bold. High concept expression means more realistic expression of the needle $f_{i}$, and low expression means more synthetic, including replacing real entities with symbolic entities or transforming $f_{i}$ into templated sentences. High context diversity means more realistic context surrounding the needles, and low means more synthetic contexts such as repeated, irrelevant padding sentences
  • Figure 3: Cosine similarity between the retrieval scores on real datasets (R, R) vs. their synthetic versions, and Spearman correlation for each setting. We use multiple limited-relation datasets for MDQA, as described in Appendix \ref{['sec:appendix_training_config']}.
  • Figure 4: Examples of symbolic data consruction for MuSiQue and SummHay Citation.
  • Figure 5: Retrieval scores for MDQA, MuSiQue, and Insight scores for SummHay Citation. Top Row: Llama-3-8B-Instruct. Bottom Row: Mistral-7B-Instruct-v0.1. The y-axis indicates the layer index and the x-axis indicates the head index within the layer. We note that retrieval heads are largely found in the last 2/3 layers of the model, as expected according to their involvement in the "final step" of copying the correct answer to the output. By contrast, SummHay Citation insight heads are concentrated in the middle layers, indicative of their intermediate role. Within a single layer, the specific important attention head indices were likely randomly primed during pretraining to be effectively adapted to the target task.
  • ...and 6 more figures