LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers

Yike Sun; Haotong Yang; Zhouchen Lin; Muhan Zhang

LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers

Yike Sun, Haotong Yang, Zhouchen Lin, Muhan Zhang

TL;DR

LiteToken tackles intermediate merge residues in BPE tokenizers, a class of tokens that are frequent during training but rarely emitted in final tokenization. It introduces a corpus-driven pipeline using the final/intermediate frequency ratio and neighbor entropy filtering to identify residues, followed by a split step to unwind IMR tokens and a re-merge step to maintain compact, linguistically aligned vocabularies. The approach is plug-and-play: residues are masked in the output layer and can be pruned from the encoding process without fine-tuning. Across multiple tokenizers and models, IMR prevalence is found to be around 5–10%, with negligible impact on language modeling and QA performance but improved robustness to misspellings and boundary artifacts, and reduced parameter counts and fragmentation.

Abstract

Tokenization is fundamental to how language models represent and process text, yet the behavior of widely used BPE tokenizers has received far less study than model architectures and training. In this paper, we investigate intermediate merge residues in BPE vocabularies: tokens that are frequent during merge learning so that retained in the final vocabulary, but are mostly further merged and rarely emitted when tokenizing the corpus during tokenizer usage. Such low-frequency tokens not only waste vocabulary capacity but also increase vulnerability to adversarial or atypical inputs. We present a systematic empirical characterization of this phenomenon across commonly used tokenizers and introduce LiteToken, a simple method for removing residue tokens. Because the affected tokens are rarely used, pretrained models can often accommodate the modified tokenizer without additional fine-tuning. Experiments show that LiteToken reduces token fragmentation, reduces parameters, and improves robustness to noisy or misspelled inputs, while preserving overall performance.

LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers

TL;DR

Abstract

Paper Structure (27 sections, 3 equations, 3 figures, 9 tables)

This paper contains 27 sections, 3 equations, 3 figures, 9 tables.

Introduction
Related Work
Tokenizer learning algorithm
BPE tokenizer and its variance
Tokenizer in LLM
Identifying Intermediate Merge Residues
Motivation
Identifying Algorithm
final/intermediate frequency ratio
Neighbor Entropy Filtering
Removing Intermediate Merge Residues
Split
Re-Merge
Output
For Tiktoken Tokenizers
...and 12 more sections

Figures (3)

Figure 1: An example of merging tree in Qwen tokenizer. Green: Tokens with high FI ratio, high frequency in final tokenized corpus; Yellow: Tokens with low FI ratio but high entropy score, which are meaningful partial tokens. Red: Tokens with low FI ratio and low entropy score (fixed combinations), i.e., intermediate merge residues, which should be removed from the vocabulary. Thresholds: 0.25 for the FI ratio and 4.0 for the entropy score.
Figure 2: Metric PPL on sampled sentences with intermediate tokens. Above: C4-en; Below: RedPajama
Figure 3: Ablation study of the threshold of frequency ratio and entropy. Above: Qwen; Below: Llama

LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers

TL;DR

Abstract

LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers

Authors

TL;DR

Abstract

Table of Contents

Figures (3)