Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization

Ninglu Shao; Shitao Xiao; Zheng Liu; Peitian Zhang

Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization

Ninglu Shao, Shitao Xiao, Zheng Liu, Peitian Zhang

TL;DR

This paper tackles the limitation of fixed context windows in large language models by introducing Extensible Tokenization, a middleware that converts raw token embeddings into compact extensible embeddings. Trained with a two-stream autoregressive objective, this approach enables flexible, inference-time scaling of context length controlled by a scaling factor $k$ without retraining the downstream LLM. The method demonstrates strong long-context performance gains in language modeling and understanding tasks, maintains compatibility with downstream and fine-tuned derivatives, and offers memory- and time-efficient streaming inference, including offline pre-computation for retrieval-augmented workflows. Overall, Extensible Tokenization provides a practical, plug-and-play solution to extend LLM context with significant efficiency advantages and broad applicability.

Abstract

Large language models (LLMs) are in need of sufficient contexts to handle many critical applications, such as retrieval augmented generation and few-shot learning. However, due to the constrained window size, the LLMs can only access to the information within a limited context. Although the size of context window can be extended by fine-tuning, it will result in a substantial cost in both training and inference stage. In this paper, we present Extensible Tokenization as an alternative method which realizes the flexible scaling of LLMs' context. Extensible Tokenization stands as a midware in between of the tokenized context and the LLM, which transforms the raw token embeddings into the extensible embeddings. Such embeddings provide a more compact representation for the long context, on top of which the LLM is able to perceive more information with the same context window. Extensible Tokenization is also featured by its flexibility: the scaling factor can be flexibly determined within a feasible scope, leading to the extension of an arbitrary context length at the inference time. Besides, Extensible Tokenization is introduced as a drop-in component, which can be seamlessly plugged into not only the LLM itself and but also its fine-tuned derivatives, bringing in the extended contextual information while fully preserving the LLM's existing capabilities. We perform comprehensive experiments on long-context language modeling and understanding tasks, which verify Extensible Tokenization as an effective, efficient, flexible, and compatible method to extend LLM's context. Our model and source code will be made publicly available.

Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization

TL;DR

without retraining the downstream LLM. The method demonstrates strong long-context performance gains in language modeling and understanding tasks, maintains compatibility with downstream and fine-tuned derivatives, and offers memory- and time-efficient streaming inference, including offline pre-computation for retrieval-augmented workflows. Overall, Extensible Tokenization provides a practical, plug-and-play solution to extend LLM context with significant efficiency advantages and broad applicability.

Abstract

Paper Structure (15 sections, 3 equations, 4 figures, 4 tables)

This paper contains 15 sections, 3 equations, 4 figures, 4 tables.

Introduction
Extensible Tokenization
Framework
Extensible Embedding
Two-Stream AR
Inference
Experiments
Experimental Settings
Main Results
Long-Context Language Modeling
Long-Context Understanding
Efficiency Analysis
Ablation Studies
Related Works
Conclusion

Figures (4)

Figure 1: Comparison between Extensible Tokenization and other context extension methods, including 1) Position Interpolation chen2023extending, 2) NTK-Aware Scaled RoPE ntkaware2023, 3) LongLLaMA tworkowski2023focused. Extensible Tokenization presents a superior long-context language modeling capability, along with better efficency in terms of memory and time. PPL is measured on PG19 raecompressive2019 following the method in chevalier2023adapting
Figure 2: Extensible Tokenization. The input data is chunked into equal-sized sub-sequences. Each sub-sequence is transformed and compressed as the extensible embeddings. The new tokens are predicted based on the extensible embeddings from the preceding chunks and the token embeddings in the same chunk. The extensible tokenizer is learned with a fixed downstream LLM.
Figure 3: Two-Stream AR. In the first pass, the raw token embeddings are transformed into extensible embeddings (with the scaling factor $k=3$). In the second pass (given a window size of 10), the auto-regression is accomplished in two steps, with the $x_{1-3}$ and $x_{4-6}$ predicted in the first step, and $x_{7-9}$ and $x_{10-12}$ predicted in the second step.
Figure 4: The extensible tokenizer trained on LLaMA-2-7B can be directly utilized by LongAlpaca-16K and LongChat-32K, leading to further scaling of their context lengths by $\times16$ and $\times32$ (with PPL measured on PG19). Remarkable, the context length of LongChat can be extended to 1 million.

Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization

TL;DR

Abstract

Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization

Authors

TL;DR

Abstract

Table of Contents

Figures (4)