Table of Contents
Fetching ...

Large Language Model as Token Compressor and Decompressor

Wenbing Li, Zikai Song, Jielei Zhang, Tianhao Zhao, Junkai Lin, Yiran Wang, Wei Yang

Abstract

In this paper, we establish the novel insight that an off-the-shelf LLM can function as an excellent token compressor and decompressor. To demonstrate, we design a self-expressive autoencoding learning framework fine-tunes a pretrained LLM to translate long texts into a compact internal language of discrete, variable-length latent codes, termed Z-tokens, and to reconstruct the original text exactly from them. The resulting representation is content-adaptive: semantically dense segments receive more Z-tokens, while redundant or predictable regions are aggressively compressed, via lightweight LoRA-based adapter heads. Empirically, our method achieves up to 18 times token reduction on Wikipedia, CNN/DailyMail, HotpotQA, and Qulac-style long-query datasets, while preserving reconstruction fidelity and downstream performance. This simple yet effective design supports applications including prompt compression and autoregressive generation directly in the Z-token space, offering a potential pathway toward token-efficient long-context reasoning.

Large Language Model as Token Compressor and Decompressor

Abstract

In this paper, we establish the novel insight that an off-the-shelf LLM can function as an excellent token compressor and decompressor. To demonstrate, we design a self-expressive autoencoding learning framework fine-tunes a pretrained LLM to translate long texts into a compact internal language of discrete, variable-length latent codes, termed Z-tokens, and to reconstruct the original text exactly from them. The resulting representation is content-adaptive: semantically dense segments receive more Z-tokens, while redundant or predictable regions are aggressively compressed, via lightweight LoRA-based adapter heads. Empirically, our method achieves up to 18 times token reduction on Wikipedia, CNN/DailyMail, HotpotQA, and Qulac-style long-query datasets, while preserving reconstruction fidelity and downstream performance. This simple yet effective design supports applications including prompt compression and autoregressive generation directly in the Z-token space, offering a potential pathway toward token-efficient long-context reasoning.

Paper Structure

This paper contains 18 sections, 11 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Our framework consists of three components: a compressor, an inference module, and a decompressor. It supports two usage paradigms. In the first, we directly feed the compressed representation to the decompressor to perform the downstream task. In the second, we first run the inference module in the Z-token space and then invoke the decompressor to recover task outputs at the token level.
  • Figure 2: Here we demonstrat the flexible use of Z-tokens. The left side shows that the same sentence can be compressed into different Z-tokens but decompressed to represent the same meaning. The right side shows, from three perspectives, that the same Z-tokens can be decoded into different content in different contexts. This illustrates that Z-tokens are controllable and interpretable, rather than chaotic.
  • Figure 3: BLEU-4 at different input length and compression ratio. The solid line represents our method, and the dashed line represents ICAE.