Table of Contents
Fetching ...

LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers

Yike Sun, Haotong Yang, Zhouchen Lin, Muhan Zhang

TL;DR

LiteToken tackles intermediate merge residues in BPE tokenizers, a class of tokens that are frequent during training but rarely emitted in final tokenization. It introduces a corpus-driven pipeline using the final/intermediate frequency ratio and neighbor entropy filtering to identify residues, followed by a split step to unwind IMR tokens and a re-merge step to maintain compact, linguistically aligned vocabularies. The approach is plug-and-play: residues are masked in the output layer and can be pruned from the encoding process without fine-tuning. Across multiple tokenizers and models, IMR prevalence is found to be around 5–10%, with negligible impact on language modeling and QA performance but improved robustness to misspellings and boundary artifacts, and reduced parameter counts and fragmentation.

Abstract

Tokenization is fundamental to how language models represent and process text, yet the behavior of widely used BPE tokenizers has received far less study than model architectures and training. In this paper, we investigate intermediate merge residues in BPE vocabularies: tokens that are frequent during merge learning so that retained in the final vocabulary, but are mostly further merged and rarely emitted when tokenizing the corpus during tokenizer usage. Such low-frequency tokens not only waste vocabulary capacity but also increase vulnerability to adversarial or atypical inputs. We present a systematic empirical characterization of this phenomenon across commonly used tokenizers and introduce LiteToken, a simple method for removing residue tokens. Because the affected tokens are rarely used, pretrained models can often accommodate the modified tokenizer without additional fine-tuning. Experiments show that LiteToken reduces token fragmentation, reduces parameters, and improves robustness to noisy or misspelled inputs, while preserving overall performance.

LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers

TL;DR

LiteToken tackles intermediate merge residues in BPE tokenizers, a class of tokens that are frequent during training but rarely emitted in final tokenization. It introduces a corpus-driven pipeline using the final/intermediate frequency ratio and neighbor entropy filtering to identify residues, followed by a split step to unwind IMR tokens and a re-merge step to maintain compact, linguistically aligned vocabularies. The approach is plug-and-play: residues are masked in the output layer and can be pruned from the encoding process without fine-tuning. Across multiple tokenizers and models, IMR prevalence is found to be around 5–10%, with negligible impact on language modeling and QA performance but improved robustness to misspellings and boundary artifacts, and reduced parameter counts and fragmentation.

Abstract

Tokenization is fundamental to how language models represent and process text, yet the behavior of widely used BPE tokenizers has received far less study than model architectures and training. In this paper, we investigate intermediate merge residues in BPE vocabularies: tokens that are frequent during merge learning so that retained in the final vocabulary, but are mostly further merged and rarely emitted when tokenizing the corpus during tokenizer usage. Such low-frequency tokens not only waste vocabulary capacity but also increase vulnerability to adversarial or atypical inputs. We present a systematic empirical characterization of this phenomenon across commonly used tokenizers and introduce LiteToken, a simple method for removing residue tokens. Because the affected tokens are rarely used, pretrained models can often accommodate the modified tokenizer without additional fine-tuning. Experiments show that LiteToken reduces token fragmentation, reduces parameters, and improves robustness to noisy or misspelled inputs, while preserving overall performance.
Paper Structure (27 sections, 3 equations, 3 figures, 9 tables)

This paper contains 27 sections, 3 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: An example of merging tree in Qwen tokenizer. Green: Tokens with high FI ratio, high frequency in final tokenized corpus; Yellow: Tokens with low FI ratio but high entropy score, which are meaningful partial tokens. Red: Tokens with low FI ratio and low entropy score (fixed combinations), i.e., intermediate merge residues, which should be removed from the vocabulary. Thresholds: 0.25 for the FI ratio and 4.0 for the entropy score.
  • Figure 2: Metric PPL on sampled sentences with intermediate tokens. Above: C4-en; Below: RedPajama
  • Figure 3: Ablation study of the threshold of frequency ratio and entropy. Above: Qwen; Below: Llama