Bootstrap Your Own Context Length
Liang Wang, Nan Yang, Xingxing Zhang, Xiaolong Huang, Furu Wei
TL;DR
The paper addresses the challenge of training long-context LLMs without relying on scarce natural long-context data by bootstrapping from short-context capabilities. It introduces a multi-step data-synthesis pipeline driven by an agent workflow, coupled with progressive context-length training to transfer short-context skills to long-context tasks. Experiments with open-source Llama-3 models show the approach can reach up to one million tokens with competitive performance on various benchmarks, including the RULER suite and needle-in-haystack tasks. The work demonstrates the viability of data-centric strategies to unlock practical long-context reasoning, while also outlining avenues for efficiency and architectural improvements in future research.
Abstract
We introduce a bootstrapping approach to train long-context language models by exploiting their short-context capabilities only. Our method utilizes a simple agent workflow to synthesize diverse long-context instruction tuning data, thereby eliminating the necessity for manual data collection and annotation. The proposed data synthesis workflow requires only a short-context language model, a text retriever, and a document collection, all of which are readily accessible within the open-source ecosystem. Subsequently, language models are fine-tuned using the synthesized data to extend their context lengths. In this manner, we effectively transfer the short-context capabilities of language models to long-context scenarios through a bootstrapping process. We conduct experiments with the open-source Llama-3 family of models and demonstrate that our method can successfully extend the context length to up to 1M tokens, achieving superior performance across various benchmarks.
