Table of Contents
Fetching ...

WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale

Jiaxi Li, Xingxing Zhang, Xun Wang, Xiaolong Huang, Li Dong, Liang Wang, Si-Qing Chen, Wei Lu, Furu Wei

TL;DR

WildLong tackles the scarcity of high-quality long-context instruction data by grounding synthesis in realistic user interactions, constructing document-type-specific meta-information graphs, and adaptively generating instruction–response pairs. The two-stage pipeline combines graph-guided path sampling with GPT-4-driven instruction adaption to produce diverse, multi-document capable data, enabling scalable long-context tuning. Fine-tuning Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct on 150K synthetic pairs yields substantial gains on long-context benchmarks (e.g., +14.7 on RULER for Mistral; 84.1 on RULER for Llama-3.1-8B) while preserving short-context performance without mixing short-context data. Ablation studies show the graph-based generation, multi-document data, and RoPE scaling interact to enhance performance, establishing WildLong as a practical paradigm for broad, realistic long-context reasoning in LLMs.

Abstract

Large language models (LLMs) with extended context windows enable tasks requiring extensive information integration but are limited by the scarcity of high-quality, diverse datasets for long-context instruction tuning. Existing data synthesis methods focus narrowly on objectives like fact retrieval and summarization, restricting their generalizability to complex, real-world tasks. WildLong extracts meta-information from real user queries, models co-occurrence relationships via graph-based methods, and employs adaptive generation to produce scalable data. It extends beyond single-document tasks to support multi-document reasoning, such as cross-document comparison and aggregation. Our models, finetuned on 150K instruction-response pairs synthesized using WildLong, surpasses existing open-source long-context-optimized models across benchmarks while maintaining strong performance on short-context tasks without incorporating supplementary short-context data. By generating a more diverse and realistic long-context instruction dataset, WildLong enhances LLMs' ability to generalize to complex, real-world reasoning over long contexts, establishing a new paradigm for long-context data synthesis.

WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale

TL;DR

WildLong tackles the scarcity of high-quality long-context instruction data by grounding synthesis in realistic user interactions, constructing document-type-specific meta-information graphs, and adaptively generating instruction–response pairs. The two-stage pipeline combines graph-guided path sampling with GPT-4-driven instruction adaption to produce diverse, multi-document capable data, enabling scalable long-context tuning. Fine-tuning Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct on 150K synthetic pairs yields substantial gains on long-context benchmarks (e.g., +14.7 on RULER for Mistral; 84.1 on RULER for Llama-3.1-8B) while preserving short-context performance without mixing short-context data. Ablation studies show the graph-based generation, multi-document data, and RoPE scaling interact to enhance performance, establishing WildLong as a practical paradigm for broad, realistic long-context reasoning in LLMs.

Abstract

Large language models (LLMs) with extended context windows enable tasks requiring extensive information integration but are limited by the scarcity of high-quality, diverse datasets for long-context instruction tuning. Existing data synthesis methods focus narrowly on objectives like fact retrieval and summarization, restricting their generalizability to complex, real-world tasks. WildLong extracts meta-information from real user queries, models co-occurrence relationships via graph-based methods, and employs adaptive generation to produce scalable data. It extends beyond single-document tasks to support multi-document reasoning, such as cross-document comparison and aggregation. Our models, finetuned on 150K instruction-response pairs synthesized using WildLong, surpasses existing open-source long-context-optimized models across benchmarks while maintaining strong performance on short-context tasks without incorporating supplementary short-context data. By generating a more diverse and realistic long-context instruction dataset, WildLong enhances LLMs' ability to generalize to complex, real-world reasoning over long contexts, establishing a new paradigm for long-context data synthesis.

Paper Structure

This paper contains 27 sections, 3 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Overview of the two-stage WildLong Framework. Stage 1 extracts meta-information from real-world user-chatbot conversations, classifies documents by type, constructs graphs to represent meta-information relationships, and samples paths to generate tailored instructions. Stage 2 pairs long documents from the pre-training corpus with these instructions, generating instruction-response pairs by rewriting the instructions and answering based on the document context.
  • Figure 2: This figure demonstrates examples of instructions generated from sampled paths in a narrative text graph. Solid lines represent connections within paths, while dotted lines show node interconnections during graph construction. Using a random walk algorithm, diverse instructions are generated by combining nodes. For instance, the knowledge node "understanding of narrative structure" and the context node "participation in a creative storytelling exercise" appear in multiple paths but result in distinct instructions due to varying other meta information.
  • Figure 3: Distribution of document types (inner circle) and task types (outer circle) in our dataset.
  • Figure 4: Comparison of short-context performances between finetuned and the baseline models. Models fine-tuned with our method preserve short-context capabilities.
  • Figure 5: Short-context and long-context performance of variations of Mistral models.