Table of Contents
Fetching ...

SkyLadder: Better and Faster Pretraining via Context Window Scheduling

Tongyao Zhu, Qian Liu, Haonan Wang, Shiqi Chen, Xiangming Gu, Tianyu Pang, Min-Yen Kan

TL;DR

This work challenges the assumption that longer pretraining context windows always improve performance under a fixed token budget. It demonstrates that shorter contexts yield stronger downstream results and introduces SkyLadder, a simple short-to-long context window scheduling strategy that gradually expands context during pretraining. Through extensive experiments across model scales (up to 3B parameters) and contexts (up to 32K), SkyLadder delivers consistent gains on standard benchmarks and long-context tasks while increasing training efficiency by up to 22%. The approach is validated across multiple packing and masking schemes and is shown to generalize to code data and other architectures, offering a practical recipe for more efficient pretraining of long-context models.

Abstract

Recent advancements in LLM pretraining have featured ever-expanding context windows to process longer sequences. However, our pilot study reveals that models pretrained with shorter context windows consistently outperform their long-context counterparts under a fixed token budget. This finding motivates us to explore an optimal context window scheduling strategy to better balance long-context capability with pretraining efficiency. To this end, we propose SkyLadder, a simple yet effective approach that implements a short-to-long context window transition. SkyLadder preserves strong standard benchmark performance, while matching or exceeding baseline results on long context tasks. Through extensive experiments, we pre-train 1B-parameter models (up to 32K context) and 3B-parameter models (8K context) on 100B tokens, demonstrating that SkyLadder yields consistent gains of up to 3.7% on common benchmarks, while achieving up to 22% faster training speeds compared to baselines. The code is at https://github.com/sail-sg/SkyLadder.

SkyLadder: Better and Faster Pretraining via Context Window Scheduling

TL;DR

This work challenges the assumption that longer pretraining context windows always improve performance under a fixed token budget. It demonstrates that shorter contexts yield stronger downstream results and introduces SkyLadder, a simple short-to-long context window scheduling strategy that gradually expands context during pretraining. Through extensive experiments across model scales (up to 3B parameters) and contexts (up to 32K), SkyLadder delivers consistent gains on standard benchmarks and long-context tasks while increasing training efficiency by up to 22%. The approach is validated across multiple packing and masking schemes and is shown to generalize to code data and other architectures, offering a practical recipe for more efficient pretraining of long-context models.

Abstract

Recent advancements in LLM pretraining have featured ever-expanding context windows to process longer sequences. However, our pilot study reveals that models pretrained with shorter context windows consistently outperform their long-context counterparts under a fixed token budget. This finding motivates us to explore an optimal context window scheduling strategy to better balance long-context capability with pretraining efficiency. To this end, we propose SkyLadder, a simple yet effective approach that implements a short-to-long context window transition. SkyLadder preserves strong standard benchmark performance, while matching or exceeding baseline results on long context tasks. Through extensive experiments, we pre-train 1B-parameter models (up to 32K context) and 3B-parameter models (8K context) on 100B tokens, demonstrating that SkyLadder yields consistent gains of up to 3.7% on common benchmarks, while achieving up to 22% faster training speeds compared to baselines. The code is at https://github.com/sail-sg/SkyLadder.

Paper Structure

This paper contains 62 sections, 1 equation, 14 figures, 28 tables.

Figures (14)

  • Figure 1: Left: Pretraining context window of LLMs grows over the recent years. Right: Average performance (in %) across nine downstream tasks for 1B-parameter models with different pretrained context window sizes (color-coded). Increasing the context window degrades the overall performance.
  • Figure 2: Schematic comparison of training-time context window scheduling.
  • Figure 3: An illustration of the workflow for pretraining data preparation highlights several critical decisions. Key considerations include the method of data packing, the type of attention mask to employ (causal or intra-doc mask), and determining the appropriate context window length $L$.
  • Figure 4: Ablation studies of different factors on different context window sizes. Note that the validation PPL is obtained on the validation documents with a sliding window size of 512 tokens. The packing strategy in (a) is Random, and the model sizes in (b) and (c) are 1B and 120M, respectively. Note that the context window in (d) means the number of available preceding tokens when making next-token prediction (calculation details in Section \ref{['app:sec:definition-of-context']}).
  • Figure 5: An illustration of SkyLadder with Random and IntraDoc. The example shows a packed sequence (length $L$) consisting of two documents. For SkyLadder, the context window $w$ starts from a small value and dynamically adjusts during training, eventually converging to the masking patterns of Random or IntraDoc.
  • ...and 9 more figures