Table of Contents
Fetching ...

Why Does the Effective Context Length of LLMs Fall Short?

Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, Lingpeng Kong

TL;DR

Left-skewed distributions of relative positions limit long-range reasoning in open-source LLMs. The authors diagnose this via the relative-position matrix and data-length patterns, and propose STRING, a training-free position-shifting technique, to overwrite ineffective tail positions during inference. STRING yields substantial improvements across multiple open-source models on Needle-in-a-Haystack, RULER, and InfiniteBench, even surpassing some commercial models in certain tasks. The work highlights the critical role of position encoding and data distribution in long-context capabilities and offers a practical, training-free path to enhance long-range processing.

Abstract

Advancements in distributed training and efficient attention mechanisms have significantly expanded the context window sizes of large language models (LLMs). However, recent work reveals that the effective context lengths of open-source LLMs often fall short, typically not exceeding half of their training lengths. In this work, we attribute this limitation to the left-skewed frequency distribution of relative positions formed in LLMs pretraining and post-training stages, which impedes their ability to effectively gather distant information. To address this challenge, we introduce ShifTed Rotray position embeddING (STRING). STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths. Experimental results show that without additional training, STRING dramatically improves the performance of the latest large-scale models, such as Llama3.1 70B and Qwen2 72B, by over 10 points on popular long-context benchmarks RULER and InfiniteBench, establishing new state-of-the-art results for open-source LLMs. Compared to commercial models, Llama 3.1 70B with \method even achieves better performance than GPT-4-128K and clearly surpasses Claude 2 and Kimi-chat.

Why Does the Effective Context Length of LLMs Fall Short?

TL;DR

Left-skewed distributions of relative positions limit long-range reasoning in open-source LLMs. The authors diagnose this via the relative-position matrix and data-length patterns, and propose STRING, a training-free position-shifting technique, to overwrite ineffective tail positions during inference. STRING yields substantial improvements across multiple open-source models on Needle-in-a-Haystack, RULER, and InfiniteBench, even surpassing some commercial models in certain tasks. The work highlights the critical role of position encoding and data distribution in long-context capabilities and offers a practical, training-free path to enhance long-range processing.

Abstract

Advancements in distributed training and efficient attention mechanisms have significantly expanded the context window sizes of large language models (LLMs). However, recent work reveals that the effective context lengths of open-source LLMs often fall short, typically not exceeding half of their training lengths. In this work, we attribute this limitation to the left-skewed frequency distribution of relative positions formed in LLMs pretraining and post-training stages, which impedes their ability to effectively gather distant information. To address this challenge, we introduce ShifTed Rotray position embeddING (STRING). STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths. Experimental results show that without additional training, STRING dramatically improves the performance of the latest large-scale models, such as Llama3.1 70B and Qwen2 72B, by over 10 points on popular long-context benchmarks RULER and InfiniteBench, establishing new state-of-the-art results for open-source LLMs. Compared to commercial models, Llama 3.1 70B with \method even achieves better performance than GPT-4-128K and clearly surpasses Claude 2 and Kimi-chat.

Paper Structure

This paper contains 25 sections, 4 equations, 9 figures, 7 tables, 2 algorithms.

Figures (9)

  • Figure 1: Position frequency distribution exhibits a pronounced left-skewed pattern across training data of varying lengths. Figure \ref{['fig:natural_dist']} illustrates the natural data length distribution of SlimPajama-627B where oversized data is truncated into multiple 2K sequences. Figure \ref{['fig:uni_dist']} presents the case with a uniform length distribution and the position frequency decline quadratically. Figure \ref{['fig:cat_dist']} demonstrates that when all data are concatenated into a 2K sequence, the position frequency decreases linearly with increasing position indices. The X-axis represents data length (shown in orange) and position indices (shown in blue). The left Y-axis indicates the frequency of each position, while the right Y-axis represents the number of data for each length.
  • Figure 2: Analyzing effective context length of LLMs pretrained on SlimPajama with respect to training length, token consumption, and position frequency. In Figure \ref{['fig:eff_freq']}, we use the model effective length as the X-axis, and the Y-axis indicates the number of times the model was exposed to that specific position during training.
  • Figure 3: Position frequency distribution for models trained with different training lengths after consuming 1T tokens. With the same number of tokens, training length has little effect on small relative positions. For example, the relative position 0 appears 4K times in both a single 4K sequence and two 2K sequences with the same total token count of 4K in each case.
  • Figure 4: NIAH results for our pretrained model TinyLlama-1.3B (2K) and Llama3.1 (128K) where the X-axis means input context length and the Y-axis represents the document depth. In this figure, we clearly observe that for TinyLlama 2K and Llama3.1 128K, most poor-performing cases are concentrated in the lower-left triangle, indicating that the models are unable to gather distant needles.
  • Figure 5: A illustrative example of StRing for a sequence length of $L = 9$. (a) Position indices $6$, $7$, and $8$ are removed from the matrix. (b) Indices $0$, $1$, $2$, $3$, $4$, and $5$ are shifted from the main diagonal to the lower-left triangle with an offset of $3$. (c) A small constant $W$ is added to all diagonals where $m \geq n - 3$, thereby restoring emphasis on the neighboring $W$ tokens. The position matrix of Llama3.1-128K using StRing is shown in Figure \ref{['fig:llama31_example']} Appendix.
  • ...and 4 more figures