Table of Contents
Fetching ...

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Pavel Chizhov, Catherine Arnett, Elizaveta Korotkova, Ivan P. Yamshchikov

TL;DR

PickyBPE is introduced, a modified BPE algorithm that carries out vocabulary refinement during tokenizer training by removing merges that leave intermediate “junk” tokens and either improves downstream performance or does not harm it.

Abstract

Language models can largely benefit from efficient tokenization. However, they still mostly utilize the classical BPE algorithm, a simple and reliable method. This has been shown to cause such issues as under-trained tokens and sub-optimal compression that may affect the downstream performance. We introduce Picky BPE, a modified BPE algorithm that carries out vocabulary refinement during tokenizer training. Our method improves vocabulary efficiency, eliminates under-trained tokens, and does not compromise text compression. Our experiments show that our method does not reduce the downstream performance, and in several cases improves it.

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

TL;DR

PickyBPE is introduced, a modified BPE algorithm that carries out vocabulary refinement during tokenizer training by removing merges that leave intermediate “junk” tokens and either improves downstream performance or does not harm it.

Abstract

Language models can largely benefit from efficient tokenization. However, they still mostly utilize the classical BPE algorithm, a simple and reliable method. This has been shown to cause such issues as under-trained tokens and sub-optimal compression that may affect the downstream performance. We introduce Picky BPE, a modified BPE algorithm that carries out vocabulary refinement during tokenizer training. Our method improves vocabulary efficiency, eliminates under-trained tokens, and does not compromise text compression. Our experiments show that our method does not reduce the downstream performance, and in several cases improves it.
Paper Structure (27 sections, 2 equations, 5 figures, 22 tables, 2 algorithms)

This paper contains 27 sections, 2 equations, 5 figures, 22 tables, 2 algorithms.

Figures (5)

  • Figure 1: An example of a series of merges to produce a token Kentucky. The pre-merge token frequencies are demonstrated in corresponding circles. In the vanilla BPE algorithm, entucky should also be stored in the vocabulary, whereas it is redundant after the merge. In this example, the IoS metric effectively captures the intermediate token, as $\mathrm{IoS}(\texttt{entucky}) \geq \mathcal{T} = 0.9$.
  • Figure 2: Picky BPE tokenization example. Token frequencies are demonstrated in the corresponding circles and are updated on merges. Token "ould" is removed only after merging into three common tokens containing it. The corresponding IoS values are visualized on every merge. Once IoS becomes greater or equal to the threshold $\mathcal{T}$, 0.9 in this example, the token "ould" is removed.
  • Figure 3: Input embedding vectors for Picky BPE tokens with (a)$\mathcal{T} = 1$ and (b)$\mathcal{T} = 0.9$ for English vocabularies of size 16384 in EN--DE experiments with separate vocabularies. For each token we compute its probability in the training corpus (y-axis), and the L2 norm of its embedding vector in the trained model (x-axis).
  • Figure 4: Token frequency distributions for English vocabularies of size 16384 in EN--DE experiments with separate vocabularies for input and output. The left tail becomes less heavy as we decrease the threshold.
  • Figure 5: Input embedding vectors for Picky BPE tokens with (a, c, e)$\mathcal{T} = 1.0$, (b)$\mathcal{T} = 0.8$, (d)$\mathcal{T} = 0.7$, and (f)$\mathcal{T} = 0.6$ for English vocabularies of size 16384 in EN--DE experiments with separate vocabularies. For each token we compute its probability in the training corpus (y-axis), and the L2 norm of its embedding vector in the trained model (x-axis).