Table of Contents
Fetching ...

Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation

Linda He, Jue Wang, Maurice Weber, Shang Zhu, Ben Athiwaratkun, Ce Zhang

TL;DR

This work tackles the scarcity and inefficiency of long-context data for instruction-tuning open LLMs by introducing a hierarchical, QA-based synthetic data pipeline that generates coherent long-context (up to $1{,}000{,}000$ tokens) instruction data from short-context models and multiple documents. It employs stepwise RoPE scaling to progressively widen the context window and demonstrates state-of-the-art ultra-long-context performance on RULER and InfiniteBench, while preserving general task abilities on LongBench and MMLU. Extensive ablations validate the importance of hierarchical structure, diverse and multi-hop questions, and fixed question counts, and robustness experiments show the method works across generator sizes. The approach is highly scalable, enabling practical training of ultra-long-context LLMs and offering a data-centric path toward longer horizons in real-world applications.

Abstract

Large Language Models (LLMs) struggle with long-context reasoning, not only due to the quadratic scaling of computational complexity with sequence length but also because of the scarcity and expense of annotating long-context data. There has been barely any open-source work that systematically ablates long-context data, nor is there any openly available instruction tuning dataset with contexts surpassing 100K tokens. To bridge this gap, we introduce a novel post-training synthetic data generation strategy designed to efficiently extend the context window of LLMs while preserving their general task performance. Our approach scalably extends to arbitrarily long context lengths, unconstrained by the length of available real-world data, which effectively addresses the scarcity of raw long-context data. Through a step-by-step rotary position embedding (RoPE) scaling training strategy, we demonstrate that our model, with a context length of up to 1M tokens, performs well on the RULER benchmark and InfiniteBench and maintains robust performance on general language tasks.

Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation

TL;DR

This work tackles the scarcity and inefficiency of long-context data for instruction-tuning open LLMs by introducing a hierarchical, QA-based synthetic data pipeline that generates coherent long-context (up to tokens) instruction data from short-context models and multiple documents. It employs stepwise RoPE scaling to progressively widen the context window and demonstrates state-of-the-art ultra-long-context performance on RULER and InfiniteBench, while preserving general task abilities on LongBench and MMLU. Extensive ablations validate the importance of hierarchical structure, diverse and multi-hop questions, and fixed question counts, and robustness experiments show the method works across generator sizes. The approach is highly scalable, enabling practical training of ultra-long-context LLMs and offering a data-centric path toward longer horizons in real-world applications.

Abstract

Large Language Models (LLMs) struggle with long-context reasoning, not only due to the quadratic scaling of computational complexity with sequence length but also because of the scarcity and expense of annotating long-context data. There has been barely any open-source work that systematically ablates long-context data, nor is there any openly available instruction tuning dataset with contexts surpassing 100K tokens. To bridge this gap, we introduce a novel post-training synthetic data generation strategy designed to efficiently extend the context window of LLMs while preserving their general task performance. Our approach scalably extends to arbitrarily long context lengths, unconstrained by the length of available real-world data, which effectively addresses the scarcity of raw long-context data. Through a step-by-step rotary position embedding (RoPE) scaling training strategy, we demonstrate that our model, with a context length of up to 1M tokens, performs well on the RULER benchmark and InfiniteBench and maintains robust performance on general language tasks.

Paper Structure

This paper contains 16 sections, 5 figures, 11 tables, 2 algorithms.

Figures (5)

  • Figure 1: High-level overview over our approach to automatically generate QA pairs for long context documents. (1) In the first step, we split a document into small and medium chunks which are then (2) summarized by an off-the-shelf LLM requiring only smaller context windows. In (3) we sample summaries at different localities in a hierarchical manner, balancing local and global views of the original document. In (4) we generate questions based on the sampled summaries. In the right panel, we show a subset of prompts used to generate diverse and complex questions, given the sampled summaries.
  • Figure 2: High-level overview over our approach to generate order-following QAs. (1) Input a raw long context document. (2) Split the document into small, medium, and global chunks, and generate summaries at each level. (3) The first QA is based on the global summary. (4) We randomly select a medium chunk to generate a QA, (5) then delve deeper by selecting a small chunk within it for another QA. (6) To continue, the process alternates between exploiting the same small chunk or exploring new medium or small chunks to generate further QAs.
  • Figure 3: High-level overview over our approach to curate long context data using multiple documents. (1) Diverse and hierarchical QAs are generated at different levels of granularity for each document. (2) $N$ hierarchical and diverse QAs are sampled and extracted from each document. (3) QAs from different documents are combined, maintaining a balance of hierarchical and diverse questions across the entire set. $N = 5$ in our algorithm, and when we revisit previous documents in step (3), we sample 3 hierarchial questions for each document with 60 $\%$ probability as well as 9 total diverse questions from all previous documents.
  • Figure 4: Effective context length up 1M tokens using Qwen-2-72B-Instruct as generator on RULER.
  • Figure 5: Effective context length using Llama-3.1-8B-Instruct and Qwen-2.5-7B-Instruct as generators on RULER.