Table of Contents
Fetching ...

zip2zip: Inference-Time Adaptive Tokenization via Online Compression

Saibo Geng, Nathan Ranchin, Yunzhen yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, Robert West

TL;DR

zip2zip introduces inference-time adaptive tokenization by pairing an online LZW-based tokenizer with a dynamic hypertoken vocabulary and a hyper-embedding module, enabling the model to compress input and output sequences without retraining tokenizer vocabularies. The system is trained on compressed sequences with a causal language modeling objective plus an auxiliary reconstruction loss, and uses a cacheable hyper-embedding mechanism to minimize runtime overhead. Empirical results show substantial token-length reductions (15–40%) and latency improvements (up to ~40%), with generally competitive performance on NLP benchmarks and modest degradations in some multilingual and numerically intensive tasks. This approach demonstrates a practical path to domain-adaptive, computation-efficient LLM inference without extensive tokenizer retraining.

Abstract

Tokenization efficiency plays a critical role in the performance and cost of large language models (LLMs), yet most models rely on static tokenizers optimized on general-purpose corpora. These tokenizers' fixed vocabularies often fail to adapt to domain- or language-specific inputs, leading to longer token sequences and higher computational costs. We introduce zip2zip, a novel method for achieving context-adaptive tokenization in LLMs at inference time. Leveraging an online data compression algorithm (Lempel-Ziv-Welch), zip2zip dynamically expands its active vocabulary at inference time by continuously replacing fragmented token sequences with more compact hypertokens, which it can immediately output during generation. In doing so, the model refines its internal tokenization scheme to match the token distribution of the current context, reducing redundancy and improving representational efficiency. zip2zip consists of three key components: (1) a tokenizer based on Lempel-Ziv-Welch compression that incrementally merges co-occurring tokens into reusable hypertokens on the fly; (2) a dynamic embedding (and unembedding) layer that computes embeddings for newly formed hypertokens at runtime; and (3) a variant of autoregressive language modeling that pretrains the model to handle hypertokenized, compressed text sequences as inputs and outputs. We show that an existing LLM can be uptrained for zip2zip in 10 GPU-hours via parameter-efficient finetuning. The resulting LLM performs test-time adaptation, learning to use hypertokens in unseen contexts and reducing input and output tokens by 15-40%.

zip2zip: Inference-Time Adaptive Tokenization via Online Compression

TL;DR

zip2zip introduces inference-time adaptive tokenization by pairing an online LZW-based tokenizer with a dynamic hypertoken vocabulary and a hyper-embedding module, enabling the model to compress input and output sequences without retraining tokenizer vocabularies. The system is trained on compressed sequences with a causal language modeling objective plus an auxiliary reconstruction loss, and uses a cacheable hyper-embedding mechanism to minimize runtime overhead. Empirical results show substantial token-length reductions (15–40%) and latency improvements (up to ~40%), with generally competitive performance on NLP benchmarks and modest degradations in some multilingual and numerically intensive tasks. This approach demonstrates a practical path to domain-adaptive, computation-efficient LLM inference without extensive tokenizer retraining.

Abstract

Tokenization efficiency plays a critical role in the performance and cost of large language models (LLMs), yet most models rely on static tokenizers optimized on general-purpose corpora. These tokenizers' fixed vocabularies often fail to adapt to domain- or language-specific inputs, leading to longer token sequences and higher computational costs. We introduce zip2zip, a novel method for achieving context-adaptive tokenization in LLMs at inference time. Leveraging an online data compression algorithm (Lempel-Ziv-Welch), zip2zip dynamically expands its active vocabulary at inference time by continuously replacing fragmented token sequences with more compact hypertokens, which it can immediately output during generation. In doing so, the model refines its internal tokenization scheme to match the token distribution of the current context, reducing redundancy and improving representational efficiency. zip2zip consists of three key components: (1) a tokenizer based on Lempel-Ziv-Welch compression that incrementally merges co-occurring tokens into reusable hypertokens on the fly; (2) a dynamic embedding (and unembedding) layer that computes embeddings for newly formed hypertokens at runtime; and (3) a variant of autoregressive language modeling that pretrains the model to handle hypertokenized, compressed text sequences as inputs and outputs. We show that an existing LLM can be uptrained for zip2zip in 10 GPU-hours via parameter-efficient finetuning. The resulting LLM performs test-time adaptation, learning to use hypertokens in unseen contexts and reducing input and output tokens by 15-40%.

Paper Structure

This paper contains 43 sections, 3 theorems, 18 equations, 10 figures, 14 tables.

Key Result

Theorem 2.1

If $g$ is lossless (i.e., bijective onto its image), then the total entropy and cross-entropy are invariant under the transformation:

Figures (10)

  • Figure 1: zip2zip inference process. At each decoding step, the model has a growing context composed of both base tokens (blue) and hypertokens (green). The static vocabulary of size 6 remains fixed, while the dynamic vocabulary is continuously expanded by merging co-occurring tokens using LZW compression. The codebook (right) maps hypertoken IDs to their corresponding base tokens. As decoding progresses, new hypertokens created at step $t$ (e.g., "to be", "or not") become immediately available for reuse at step $t+1$. Hypertokens are also eligible for merging, enabling the formation of nested hypertokens. The final output sequence (bottom) is reconstructed via LZW decompression.
  • Figure 2: (a) Dynamic embedding: Base tokens are embedded via a static LM embedding matrix, while hypertokens (e.g., "to be" or "to be or") are dynamically composed using a hyper-encoder over their constituent base tokens. (b) Language modeling in compressed space: The model is trained to predict compressed token sequences produced by LZW, optimizing cross-entropy loss over compressed token IDs. (c) Auto-encoding loss: To ensure hypertokens are semantically consistent with their base-token compositions, the model also learns to reconstruct the original base tokens from the hyper-token via a decoding loss.
  • Figure 3: zip2zip architecture and pipeline. At inference time, base tokens are compressed into hypertokens using LZW. A hyper-encoder computes embeddings for hypertokens, which are processed by the base LLM. Output representations are projected jointly on base and hyper-unembedding layers, producing joint logits and sampled tokens, which can be decoded back to base tokens.
  • Figure 4: Phi-3.5-zip2zip output examples.Blue: base tokens. Yellow: hypertokens (composed of 2 base tokens). Orange: hypertokens (composed of 3+ base tokens).
  • Figure 5: Effect of maximum merge size $M$ on zip2zip training loss: $M = 1$ (no compression) achieves the lowest loss overall. Among compressed settings, $M= 3$ performs best, while $M=2$ shows the worst convergence. Larger $M$ (4 and 5) yield slightly worse results than $M=3$.
  • ...and 5 more figures

Theorems & Definitions (6)

  • Theorem 2.1: Entropy invariance under lossless compression
  • Proposition B.1
  • proof
  • Definition B.1: Compression Rate
  • Theorem G.1: Entropy Invariance under Lossless Compression
  • proof