LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers
Yike Sun, Haotong Yang, Zhouchen Lin, Muhan Zhang
TL;DR
LiteToken tackles intermediate merge residues in BPE tokenizers, a class of tokens that are frequent during training but rarely emitted in final tokenization. It introduces a corpus-driven pipeline using the final/intermediate frequency ratio and neighbor entropy filtering to identify residues, followed by a split step to unwind IMR tokens and a re-merge step to maintain compact, linguistically aligned vocabularies. The approach is plug-and-play: residues are masked in the output layer and can be pruned from the encoding process without fine-tuning. Across multiple tokenizers and models, IMR prevalence is found to be around 5–10%, with negligible impact on language modeling and QA performance but improved robustness to misspellings and boundary artifacts, and reduced parameter counts and fragmentation.
Abstract
Tokenization is fundamental to how language models represent and process text, yet the behavior of widely used BPE tokenizers has received far less study than model architectures and training. In this paper, we investigate intermediate merge residues in BPE vocabularies: tokens that are frequent during merge learning so that retained in the final vocabulary, but are mostly further merged and rarely emitted when tokenizing the corpus during tokenizer usage. Such low-frequency tokens not only waste vocabulary capacity but also increase vulnerability to adversarial or atypical inputs. We present a systematic empirical characterization of this phenomenon across commonly used tokenizers and introduce LiteToken, a simple method for removing residue tokens. Because the affected tokens are rarely used, pretrained models can often accommodate the modified tokenizer without additional fine-tuning. Experiments show that LiteToken reduces token fragmentation, reduces parameters, and improves robustness to noisy or misspelled inputs, while preserving overall performance.
