Table of Contents
Fetching ...

Bootstrap Your Own Context Length

Liang Wang, Nan Yang, Xingxing Zhang, Xiaolong Huang, Furu Wei

TL;DR

The paper addresses the challenge of training long-context LLMs without relying on scarce natural long-context data by bootstrapping from short-context capabilities. It introduces a multi-step data-synthesis pipeline driven by an agent workflow, coupled with progressive context-length training to transfer short-context skills to long-context tasks. Experiments with open-source Llama-3 models show the approach can reach up to one million tokens with competitive performance on various benchmarks, including the RULER suite and needle-in-haystack tasks. The work demonstrates the viability of data-centric strategies to unlock practical long-context reasoning, while also outlining avenues for efficiency and architectural improvements in future research.

Abstract

We introduce a bootstrapping approach to train long-context language models by exploiting their short-context capabilities only. Our method utilizes a simple agent workflow to synthesize diverse long-context instruction tuning data, thereby eliminating the necessity for manual data collection and annotation. The proposed data synthesis workflow requires only a short-context language model, a text retriever, and a document collection, all of which are readily accessible within the open-source ecosystem. Subsequently, language models are fine-tuned using the synthesized data to extend their context lengths. In this manner, we effectively transfer the short-context capabilities of language models to long-context scenarios through a bootstrapping process. We conduct experiments with the open-source Llama-3 family of models and demonstrate that our method can successfully extend the context length to up to 1M tokens, achieving superior performance across various benchmarks.

Bootstrap Your Own Context Length

TL;DR

The paper addresses the challenge of training long-context LLMs without relying on scarce natural long-context data by bootstrapping from short-context capabilities. It introduces a multi-step data-synthesis pipeline driven by an agent workflow, coupled with progressive context-length training to transfer short-context skills to long-context tasks. Experiments with open-source Llama-3 models show the approach can reach up to one million tokens with competitive performance on various benchmarks, including the RULER suite and needle-in-haystack tasks. The work demonstrates the viability of data-centric strategies to unlock practical long-context reasoning, while also outlining avenues for efficiency and architectural improvements in future research.

Abstract

We introduce a bootstrapping approach to train long-context language models by exploiting their short-context capabilities only. Our method utilizes a simple agent workflow to synthesize diverse long-context instruction tuning data, thereby eliminating the necessity for manual data collection and annotation. The proposed data synthesis workflow requires only a short-context language model, a text retriever, and a document collection, all of which are readily accessible within the open-source ecosystem. Subsequently, language models are fine-tuned using the synthesized data to extend their context lengths. In this manner, we effectively transfer the short-context capabilities of language models to long-context scenarios through a bootstrapping process. We conduct experiments with the open-source Llama-3 family of models and demonstrate that our method can successfully extend the context length to up to 1M tokens, achieving superior performance across various benchmarks.

Paper Structure

This paper contains 20 sections, 5 figures, 10 tables.

Figures (5)

  • Figure 1: The overall workflow for synthesizing long-context instruction tuning data comprises four steps: instruction generation, relevant document retrieval, recursive query-focused summarization, and response generation. The generated instructions and retrieved documents are concatenated to form the user-turn input, whereas the generated response serves as the target output.
  • Figure 2: Needle-in-haystack test results. The x-axis represents the context lengths, while the y-axis indicates the depth of the inserted needle. The color coding corresponds to the recall score following previous work fu2024data, where green signifies a score close to 1, and red denotes a score close to 0. A single trial was conducted for each unique combination of context length and needle depth. The grey shaded regions denote context lengths beyond the model's capability.
  • Figure 3: Scatter plot illustrating the relationship between the required generation length and the actual output length for samples from the validation set. The dashed line denotes $y=x$, indicating the output length precisely matches the groundtruth length. For each model, we fit a curve to show the trend of the output length as the required length increases. Details of the curve fitting procedure are provided in Appendix \ref{['sec:app_implementation']}.
  • Figure 4: The evolving performance across various test lengths as SelfLong-8B undergoes progressive training on longer contexts. The term "Supported Lengths" denotes $128$k or shorter, which Llama-3.1-8B-Instruct can already handle. "Extended Lengths" refer to the context lengths exceeding $128$k. If a context length is larger than the model's maximum training length, the score is assigned a value of $0$.
  • Figure 5: Needle-in-haystack test results when extending the context length up to 4M. For the $4$M version, tests were conducted within $3$M context length due to the prohibitively high inference cost.