Table of Contents
Fetching ...

Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning

Wenhao Zhu, Pinzhen Chen, Hanxu Hu, Shujian Huang, Fei Yuan, Jiajun Chen, Alexandra Birch

TL;DR

The paper tackles the challenge of long-context instruction tuning for LLMs by analyzing how context affects learning and proposing context synthesis, a data-generation framework that creates tailored background contexts for existing instruction-answer pairs. A controlled pilot study demonstrates that short-context training with targeted context can generalize to longer contexts, while longer-context data yields optimal performance on hard tasks. In real-world tests on LongBench, context synthesis outperforms previous instruction-synthesis methods and comes close to using human-annotated long-context data, with strong generalization to unseen tasks. The work introduces an analytic tool to measure instruction-context coherence, revealing limitations of prior synthesis approaches and guiding more effective data design for long-context models.

Abstract

Long-context modelling for large language models (LLMs) has been a key area of recent research because many real world use cases require reasoning over longer inputs such as documents. The focus of research into modelling long context has been on how to model position and there has been little investigation into other important aspects of language modelling such as instruction tuning. Long context training examples are challenging and expensive to create and use. In this paper, we investigate how to design instruction data for the post-training phase of a long context pre-trained model: how much and what type of context is needed for optimal and efficient post-training. Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones, while also identifying other critical factors such as instruction difficulty and context composition. Based on these findings, we propose context synthesis, a novel data synthesis framework that leverages off-the-shelf LLMs to generate extended background contexts for high-quality instruction-answer pairs. Experiment results on the document-level benchmark (LongBench) demonstrate that our proposed approach outperforms previous instruction synthesis approaches and comes close to the performance of human-annotated long-context instruction data. The project will be available at: https://github.com/NJUNLP/context-synthesis.

Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning

TL;DR

The paper tackles the challenge of long-context instruction tuning for LLMs by analyzing how context affects learning and proposing context synthesis, a data-generation framework that creates tailored background contexts for existing instruction-answer pairs. A controlled pilot study demonstrates that short-context training with targeted context can generalize to longer contexts, while longer-context data yields optimal performance on hard tasks. In real-world tests on LongBench, context synthesis outperforms previous instruction-synthesis methods and comes close to using human-annotated long-context data, with strong generalization to unseen tasks. The work introduces an analytic tool to measure instruction-context coherence, revealing limitations of prior synthesis approaches and guiding more effective data design for long-context models.

Abstract

Long-context modelling for large language models (LLMs) has been a key area of recent research because many real world use cases require reasoning over longer inputs such as documents. The focus of research into modelling long context has been on how to model position and there has been little investigation into other important aspects of language modelling such as instruction tuning. Long context training examples are challenging and expensive to create and use. In this paper, we investigate how to design instruction data for the post-training phase of a long context pre-trained model: how much and what type of context is needed for optimal and efficient post-training. Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones, while also identifying other critical factors such as instruction difficulty and context composition. Based on these findings, we propose context synthesis, a novel data synthesis framework that leverages off-the-shelf LLMs to generate extended background contexts for high-quality instruction-answer pairs. Experiment results on the document-level benchmark (LongBench) demonstrate that our proposed approach outperforms previous instruction synthesis approaches and comes close to the performance of human-annotated long-context instruction data. The project will be available at: https://github.com/NJUNLP/context-synthesis.

Paper Structure

This paper contains 45 sections, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Illustration of two long-context instruction data synthesis frameworks: instruction synthesis and context synthesis (ours). The light-colored blocks indicate potentially lower-quality components in the synthesized data samples.
  • Figure 2: Impact of varying instruction tuning configurations on long-context performance. The detailed differences between these configurations is presented in Table \ref{['tab:explanation']}. Test length "$\sim$0k" means test contexts containing only the relevant information (needle) without any additional content.
  • Figure 3: Our prompt template for synthesizing context from instruction-answer pairs. The template takes an instruction-answer pair as input, where <instruction> and <answer> are replaced with the actual instruction and answer text. The system prompt ensures the output follows a consistent format, while the user content guides the LLM to generate context that supports the given instruction-answer pair.
  • Figure 4: In this figure we compare model performance after instruction-tuning contrasting instruction synthesis with our approach of context synthesis. In both cases, we compare tuning without context (diagonal lines) with tuning with context (solid bars). We also illustrate the gap between synthesized data and oracle human-annotated data (red dotted line). Experiments are conducted with LLaMA3.1-8B.
  • Figure 5: In the left panel, we present a task-wise performance comparison of different synthesis strategies. In the right panel, we display the context length distribution of different synthesis strategies against test sets across different tasks.
  • ...and 3 more figures