Table of Contents
Fetching ...

From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models

Chejian Xu, Wei Ping, Peng Xu, Zihan Liu, Boxin Wang, Mohammad Shoeybi, Bo Li, Bryan Catanzaro

TL;DR

This work tackles the context-length bottleneck in LLMs by proposing an efficient, two-stage training recipe that scales context windows from 128K to up to 4M tokens. It combines one-step continued pretraining with YaRN RoPE scaling and special document separators, followed by instruction tuning on a diverse short-context SFT dataset, yielding UltraLong-8B built on Llama-3.1-Instruct. The model achieves state-of-the-art performance on long-context benchmarks such as RULER, LV-Eval, and InfiniteBench while maintaining competitive results on standard tasks, and the authors provide extensive ablations that identify separators, YaRN scaling, and one-step extension as critical. Releasing the weights and detailed training guidance supports reproducibility and practical deployment for long-context reasoning and document-scale understanding.

Abstract

Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling, all of which require models to process and reason over long sequences of text and multimodal data. In this work, we introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens. Our approach leverages efficient continued pretraining strategies to extend the context window and employs effective instruction tuning to maintain the instruction-following and reasoning abilities. Our UltraLong-8B, built on Llama3.1-Instruct with our recipe, achieves state-of-the-art performance across a diverse set of long-context benchmarks. Importantly, models trained with our approach maintain competitive performance on standard benchmarks, demonstrating balanced improvements for both long and short context tasks. We further provide an in-depth analysis of key design choices, highlighting the impacts of scaling strategies and data composition. Our findings establish a robust framework for efficiently scaling context lengths while preserving general model capabilities. We release all model weights at: https://ultralong.github.io/.

From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models

TL;DR

This work tackles the context-length bottleneck in LLMs by proposing an efficient, two-stage training recipe that scales context windows from 128K to up to 4M tokens. It combines one-step continued pretraining with YaRN RoPE scaling and special document separators, followed by instruction tuning on a diverse short-context SFT dataset, yielding UltraLong-8B built on Llama-3.1-Instruct. The model achieves state-of-the-art performance on long-context benchmarks such as RULER, LV-Eval, and InfiniteBench while maintaining competitive results on standard tasks, and the authors provide extensive ablations that identify separators, YaRN scaling, and one-step extension as critical. Releasing the weights and detailed training guidance supports reproducibility and practical deployment for long-context reasoning and document-scale understanding.

Abstract

Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling, all of which require models to process and reason over long sequences of text and multimodal data. In this work, we introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens. Our approach leverages efficient continued pretraining strategies to extend the context window and employs effective instruction tuning to maintain the instruction-following and reasoning abilities. Our UltraLong-8B, built on Llama3.1-Instruct with our recipe, achieves state-of-the-art performance across a diverse set of long-context benchmarks. Importantly, models trained with our approach maintain competitive performance on standard benchmarks, demonstrating balanced improvements for both long and short context tasks. We further provide an in-depth analysis of key design choices, highlighting the impacts of scaling strategies and data composition. Our findings establish a robust framework for efficiently scaling context lengths while preserving general model capabilities. We release all model weights at: https://ultralong.github.io/.

Paper Structure

This paper contains 27 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of our training pipeline. In Stage 1, the model's context window is extended through continued pretraining, leveraging techniques such as special document separators and YaRN-based scaling to handle ultra-long sequences. In Stage 2, instruction tuning is applied using a curated dataset to enhance the model's instruction-following and reasoning capabilities. This pipeline enables the development of language models that achieve good performance on both long-context and standard benchmarks.
  • Figure 2: $\mathtt{Llama\text{-}3.1\text{-}8B\text{-}Instruct}$
  • Figure 3: $\mathtt{Llama\text{-}3\text{-}8B\text{-}ProLong\text{-}512k\text{-}Instruct}$
  • Figure 4: $\mathtt{Llama\text{-}3\text{-}8B\text{-}Instruct\text{-}Gradient\text{-}1048k}$
  • Figure 5: $\mathtt{UltraLong\text{-}8B\text{-}1M\text{-}Instruct}$
  • ...and 2 more figures