Table of Contents
Fetching ...

XL3M: A Training-free Framework for LLM Length Extension Based on Segment-wise Inference

Shengnan Wang, Youhui Bai, Lin Zhang, Pingyi Zhou, Shixiong Zhao, Gong Zhang, Sen Wang, Renhai Chen, Hua Xu, Hongwei Sun

TL;DR

Long sequences often exceed LLM context windows, hindering practical use. XL$^3$M is a training-free inference framework that segments a long input into short sub-contexts with a shared question, evaluates local distributions, and constructs a concise key context by selecting low-entropy segments for final inference. Empirically, it demonstrates strong long-context reasoning on LongBench-E and Needle in a Haystack, matching or surpassing some fine-tuned baselines while remaining memory- and compute-efficient. This approach enables scalable, streaming long-input reasoning without costly additional training, broadening the applicability of off-the-shelf LLMs.

Abstract

Length generalization failure problem, namely the large language model (LLM) fails to generalize to texts longer than its maximum training length, greatly restricts the application of LLM in the scenarios with streaming long inputs. To address this problem, the existing methods either require substantial costs or introduce precision loss. In this paper, we empirically find that the accuracy of the LLM's prediction is highly correlated to its certainty. Based on this, we propose an efficient training free framework, named XL3M (it means extra-long large language model), which enables the LLMs trained on short sequences to reason extremely long sequence without any further training or fine-tuning. Under the XL3M framework, the input context will be firstly decomposed into multiple short sub-contexts, where each sub-context contains an independent segment and a common ``question'' which is a few tokens from the end of the original context. Then XL3M gives a method to measure the relevance between each segment and the ``question'', and constructs a concise key context by splicing all the relevant segments in chronological order. The key context is further used instead of the original context to complete the inference task. Evaluations on comprehensive benchmarks show the superiority of XL3M. Using our framework, a Llama2-7B model is able to reason 20M long sequences on an 8-card Huawei Ascend 910B NPU machine with 64GB memory per card.

XL3M: A Training-free Framework for LLM Length Extension Based on Segment-wise Inference

TL;DR

Long sequences often exceed LLM context windows, hindering practical use. XLM is a training-free inference framework that segments a long input into short sub-contexts with a shared question, evaluates local distributions, and constructs a concise key context by selecting low-entropy segments for final inference. Empirically, it demonstrates strong long-context reasoning on LongBench-E and Needle in a Haystack, matching or surpassing some fine-tuned baselines while remaining memory- and compute-efficient. This approach enables scalable, streaming long-input reasoning without costly additional training, broadening the applicability of off-the-shelf LLMs.

Abstract

Length generalization failure problem, namely the large language model (LLM) fails to generalize to texts longer than its maximum training length, greatly restricts the application of LLM in the scenarios with streaming long inputs. To address this problem, the existing methods either require substantial costs or introduce precision loss. In this paper, we empirically find that the accuracy of the LLM's prediction is highly correlated to its certainty. Based on this, we propose an efficient training free framework, named XL3M (it means extra-long large language model), which enables the LLMs trained on short sequences to reason extremely long sequence without any further training or fine-tuning. Under the XL3M framework, the input context will be firstly decomposed into multiple short sub-contexts, where each sub-context contains an independent segment and a common ``question'' which is a few tokens from the end of the original context. Then XL3M gives a method to measure the relevance between each segment and the ``question'', and constructs a concise key context by splicing all the relevant segments in chronological order. The key context is further used instead of the original context to complete the inference task. Evaluations on comprehensive benchmarks show the superiority of XL3M. Using our framework, a Llama2-7B model is able to reason 20M long sequences on an 8-card Huawei Ascend 910B NPU machine with 64GB memory per card.
Paper Structure (19 sections, 1 equation, 6 figures, 2 tables)

This paper contains 19 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The relationship between the accuracy and certainty of LLM's prediction.
  • Figure 2: The main procedure of XL$^3$M.
  • Figure 3: Average score ($\%$) under different context length on LongBench-E.
  • Figure 4: Pressure test on "Needle in a Haystack". The test was run at 4 different lengths (16k $\to$ 128k) and 10 different ranges of document depth (buttom $\to$ top). Each result is average by 10 independent runs.
  • Figure 5: Pressure test on "Needle in a Haystack" over a larger range of lengths. Left: recall accuracy of XL$^3$M-7B-2k. Right: recall accuracy of XL$^3$M-7B-4k.
  • ...and 1 more figures