Table of Contents
Fetching ...

Extending Context Window of Large Language Models from a Distributional Perspective

Yingsheng Wu, Yuxuan Gu, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin

TL;DR

This paper presents a novel extension strategy that minimizes the disturbance between rotary angle distributions to maintain consistency with the pre-training phase, enhancing the model’s capability to generalize to longer sequences.

Abstract

Scaling the rotary position embedding (RoPE) has become a common method for extending the context window of RoPE-based large language models (LLMs). However, existing scaling methods often rely on empirical approaches and lack a profound understanding of the internal distribution within RoPE, resulting in suboptimal performance in extending the context window length. In this paper, we propose to optimize the context window extending task from the view of rotary angle distribution. Specifically, we first estimate the distribution of the rotary angles within the model and analyze the extent to which length extension perturbs this distribution. Then, we present a novel extension strategy that minimizes the disturbance between rotary angle distributions to maintain consistency with the pre-training phase, enhancing the model's capability to generalize to longer sequences. Experimental results compared to the strong baseline methods demonstrate that our approach reduces by up to 72% of the distributional disturbance when extending LLaMA2's context window to 8k, and reduces by up to 32% when extending to 16k. On the LongBench-E benchmark, our method achieves an average improvement of up to 4.33% over existing state-of-the-art methods. Furthermore, Our method maintains the model's performance on the Hugging Face Open LLM benchmark after context window extension, with only an average performance fluctuation ranging from -0.12 to +0.22.

Extending Context Window of Large Language Models from a Distributional Perspective

TL;DR

This paper presents a novel extension strategy that minimizes the disturbance between rotary angle distributions to maintain consistency with the pre-training phase, enhancing the model’s capability to generalize to longer sequences.

Abstract

Scaling the rotary position embedding (RoPE) has become a common method for extending the context window of RoPE-based large language models (LLMs). However, existing scaling methods often rely on empirical approaches and lack a profound understanding of the internal distribution within RoPE, resulting in suboptimal performance in extending the context window length. In this paper, we propose to optimize the context window extending task from the view of rotary angle distribution. Specifically, we first estimate the distribution of the rotary angles within the model and analyze the extent to which length extension perturbs this distribution. Then, we present a novel extension strategy that minimizes the disturbance between rotary angle distributions to maintain consistency with the pre-training phase, enhancing the model's capability to generalize to longer sequences. Experimental results compared to the strong baseline methods demonstrate that our approach reduces by up to 72% of the distributional disturbance when extending LLaMA2's context window to 8k, and reduces by up to 32% when extending to 16k. On the LongBench-E benchmark, our method achieves an average improvement of up to 4.33% over existing state-of-the-art methods. Furthermore, Our method maintains the model's performance on the Hugging Face Open LLM benchmark after context window extension, with only an average performance fluctuation ranging from -0.12 to +0.22.
Paper Structure (32 sections, 9 equations, 9 figures, 10 tables)

This paper contains 32 sections, 9 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Rotary angle distributions of extrapolation and interpolation methods in two different dimensions, compared with the pre-trained angle distribution. (a) In one dimension, the extrapolated rotary angle distribution fits more closely with the pre-trained distribution. (b) In another dimension, the interpolated distribution fits better with the pre-trained distribution.
  • Figure 2: An example of context window extension, where green and blue points denote pre-trained and OOD position indices. Upper: Extrapolation directly models position indices with RoPE. Lower: Interpolation mitigates the OOD problem of position indices while introducing unseen rotary angles (cross points).
  • Figure 3: The learned rotary angle distributions of LLaMA2. We demonstrate the $6$th and $22$nd dimensions during pre-training within the 4k length, and the corresponding rotary angle distributions when extended to 8k via interpolation and extrapolation, respectively. We set the number of intervals to $b=360$ and we only display the first 24 intervals for clarity. The distributions of full intervals are provided in \ref{['sec:appendixA.1']}.
  • Figure 4: Passkey retrieval performance of models with different sizes under various context window lengths.
  • Figure 5: Performance of LLaMA2 declines on the LongBench-E with the increasing disturbance.
  • ...and 4 more figures