Table of Contents
Fetching ...

Breaking the Stage Barrier: A Novel Single-Stage Approach to Long Context Extension for Large Language Models

Haoran Lian, Junmin Chen, Wei Huang, Yizhe Xiong, Wenping Hu, Guiguang Ding, Hui Chen, Jianwei Niu, Zijia Lin, Fuzheng Zhang, Di Zhang

TL;DR

The paper tackles the challenge of enabling long-context processing in large language models without the complexity of multi-stage pretraining. It introduces HARPE, a single-stage continual pretraining approach that assigns distinct RoPE base frequencies to individual attention heads, effectively simulating multiple training stages within one pass. Across four long-context benchmarks including RULER, HARPE matches or surpasses multi-stage methods, with notable gains such as a 5.46% improvement on Needle-in-a-Haystack and successful extension to contexts up to 128k tokens, while preserving short-context performance. By reducing manual tuning and pipeline complexity, HARPE offers a practical and scalable path to endow LLMs with robust long-context capabilities.

Abstract

Recently, Large language models (LLMs) have revolutionized Natural Language Processing (NLP). Pretrained LLMs, due to limited training context size, struggle with handling long token sequences, limiting their performance on various downstream tasks. Current solutions toward long context modeling often employ multi-stage continual pertaining, which progressively increases the effective context length through several continual pretraining stages. However, those approaches require extensive manual tuning and human expertise. In this paper, we introduce a novel single-stage continual pretraining method, Head-Adaptive Rotary Position Encoding (HARPE), to equip LLMs with long context modeling capabilities while simplifying the training process. Our HARPE leverages different Rotary Position Encoding (RoPE) base frequency values across different attention heads and directly trains LLMs on the target context length. Extensive experiments on 4 language modeling benchmarks, including the latest RULER benchmark, demonstrate that HARPE excels in understanding and integrating long-context tasks with single-stage training, matching and even outperforming existing multi-stage methods. Our results highlight that HARPE successfully breaks the stage barrier for training LLMs with long context modeling capabilities.

Breaking the Stage Barrier: A Novel Single-Stage Approach to Long Context Extension for Large Language Models

TL;DR

The paper tackles the challenge of enabling long-context processing in large language models without the complexity of multi-stage pretraining. It introduces HARPE, a single-stage continual pretraining approach that assigns distinct RoPE base frequencies to individual attention heads, effectively simulating multiple training stages within one pass. Across four long-context benchmarks including RULER, HARPE matches or surpasses multi-stage methods, with notable gains such as a 5.46% improvement on Needle-in-a-Haystack and successful extension to contexts up to 128k tokens, while preserving short-context performance. By reducing manual tuning and pipeline complexity, HARPE offers a practical and scalable path to endow LLMs with robust long-context capabilities.

Abstract

Recently, Large language models (LLMs) have revolutionized Natural Language Processing (NLP). Pretrained LLMs, due to limited training context size, struggle with handling long token sequences, limiting their performance on various downstream tasks. Current solutions toward long context modeling often employ multi-stage continual pertaining, which progressively increases the effective context length through several continual pretraining stages. However, those approaches require extensive manual tuning and human expertise. In this paper, we introduce a novel single-stage continual pretraining method, Head-Adaptive Rotary Position Encoding (HARPE), to equip LLMs with long context modeling capabilities while simplifying the training process. Our HARPE leverages different Rotary Position Encoding (RoPE) base frequency values across different attention heads and directly trains LLMs on the target context length. Extensive experiments on 4 language modeling benchmarks, including the latest RULER benchmark, demonstrate that HARPE excels in understanding and integrating long-context tasks with single-stage training, matching and even outperforming existing multi-stage methods. Our results highlight that HARPE successfully breaks the stage barrier for training LLMs with long context modeling capabilities.

Paper Structure

This paper contains 17 sections, 10 equations, 2 figures, 8 tables, 1 algorithm.

Figures (2)

  • Figure 1: Illustration of the multi-stage and our proposed single-stage (HARPE) continual pretraining pipeline.
  • Figure 2: Traditional Single-Key Needle-in-a-Haystack: the x-axis represents the number of tokens in the test sample, ranging up to 128k tokens with finer granularity. The y-axis shows the depth of the needle's position within the current test sample.