CLEX: Continuous Length Extrapolation for Large Language Models

Guanzheng Chen; Xin Li; Zaiqiao Meng; Shangsong Liang; Lidong Bing

CLEX: Continuous Length Extrapolation for Large Language Models

Guanzheng Chen, Xin Li, Zaiqiao Meng, Shangsong Liang, Lidong Bing

TL;DR

This work tackles the fundamental limit of fixed context windows in Transformer-based LLMs by introducing Continuous Length Extrapolation (CLEX), a technique that treats position-embedding scaling as a continuous dynamical process learned with a neural ODE. By parameterizing continuous dynamics over the RoPE frequency basis, CLEX enables fine-grained extrapolation to much longer context lengths while preserving performance on shorter, trained lengths, and does so with minimal latency as a drop-in RoPE component. Empirical results show CLEX can extend context to 4x–8x training length, with strong perplexity and LongBench performance, and scaling benefits with larger base models; ablations confirm the necessity of continuous dynamics and the effectiveness of the proposed training/testing strategy. Overall, CLEX advances practical long-context reasoning for open-source LLMs and points to a scalable path for future long-context NLP systems.

Abstract

Transformer-based Large Language Models (LLMs) are pioneering advances in many natural language processing tasks, however, their exceptional capabilities are restricted within the preset context window of Transformer. Position Embedding (PE) scaling methods, while effective in extending the context window to a specific length, demonstrate either notable limitations in their extrapolation abilities or sacrificing partial performance within the context window. Length extrapolation methods, although theoretically capable of extending the context window beyond the training sequence length, often underperform in practical long-context applications. To address these challenges, we propose Continuous Length EXtrapolation (CLEX) for LLMs. We generalise the PE scaling approaches to model the continuous dynamics by ordinary differential equations over the length scaling factor, thereby overcoming the constraints of current PE scaling methods designed for specific lengths. Moreover, by extending the dynamics to desired context lengths beyond the training sequence length, CLEX facilitates the length extrapolation with impressive performance in practical tasks. We demonstrate that CLEX can be seamlessly incorporated into LLMs equipped with Rotary Position Embedding, such as LLaMA and GPT-NeoX, with negligible impact on training and inference latency. Experimental results reveal that CLEX can effectively extend the context window to over 4x or almost 8x training length, with no deterioration in performance. Furthermore, when evaluated on the practical LongBench benchmark, our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k. Our code is available at https://github.com/DAMO-NLP-SG/CLEX.

CLEX: Continuous Length Extrapolation for Large Language Models

TL;DR

Abstract

Paper Structure (33 sections, 14 equations, 10 figures, 8 tables, 1 algorithm)

This paper contains 33 sections, 14 equations, 10 figures, 8 tables, 1 algorithm.

Introduction
Preliminaries
Rotary Position Embedding (RoPE)
PE Scaling Methods
Methodology
Position Embedding Scaling: A Unified View
Theorem 1.
Continuous PE Scaling via Neural ODE
Continuous Length Extrapolation: Train on Short, Test on Long
Experiments
Long-Context Language Modelling
CLEX achieves length extrapolation.
The scaling law for the extrapolation ability of CLEX.
Ablation Study
Continuous dynamics.
...and 18 more sections

Figures (10)

Figure 1: The PPLs of our CLEX and various baselines tested on 64k context length.
Figure 2: The graphical model of discrete PE scaling (left) and our continuous PE scaling (right).
Figure 3: Left: The PPLs of CLEX on different evaluation sequence lengths with 7B and 13B parameter sizes. Right: The PPLs of CLEX cross variable training data size with different parameter sizes and evaluation lengths.
Figure 4: The ablation studies for continuous dynamics, sampling strategies and $\lambda$ factor in \ref{['eq:network']}.
Figure 5: Left: the average scores for each domain of tasks in LongBench. Right: the average scores of all tasks corresponding to the training length of each model. Note that CLEX is trained on 4k sequence length but directly tested on 16k context length without truncation.
...and 5 more figures

CLEX: Continuous Length Extrapolation for Large Language Models

TL;DR

Abstract

CLEX: Continuous Length Extrapolation for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)