Batching BPE Tokenization Merges

Alexander P. Morgan

Batching BPE Tokenization Merges

Alexander P. Morgan

TL;DR

The paper addresses the challenge of training BPE tokenizers on compute- and memory-constrained hardware. It introduces BatchBPE, a pure-Python implementation that batches merges, reducing the training cost from $Tokens \times VocabSize$ to $Tokens \times NumBatches$ and relies on a chunk-frequency dictionary to process text efficiently. It investigates stop-word preprocessing and filtering of rare chunks (freq_cutoff) and demonstrates how these settings modestly affect encoded text length, providing a practical framework for tokenization experiments. The work culminates in an open-source tool and a view that batching will be increasingly important as vocabularies grow, enabling researchers to prototype tokenization strategies on commodity hardware.

Abstract

The Byte Pair Encoding algorithm can be safely batched to merge hundreds of pairs of tokens at a time when building up a tokenizer's vocabulary. This technique combined with reducing the memory footprint of text used in vocabulary training make it feasible to train a high quality tokenizer on a basic laptop. This paper presents BatchBPE, an open-source pure Python implementation of these concepts, with the goal of making experimenting with new tokenization strategies more accessible especially in compute- and memory-constrained contexts. BatchBPE's usefulness and malleability are demonstrated through the training of several token vocabularies to explore the batch merging process and experiment with preprocessing a stop word list and ignoring the least common text chunks in a dataset. Resultant encoded lengths of texts are used as a basic evaluation metric.

Batching BPE Tokenization Merges

TL;DR

and relies on a chunk-frequency dictionary to process text efficiently. It investigates stop-word preprocessing and filtering of rare chunks (freq_cutoff) and demonstrates how these settings modestly affect encoded text length, providing a practical framework for tokenization experiments. The work culminates in an open-source tool and a view that batching will be increasingly important as vocabularies grow, enabling researchers to prototype tokenization strategies on commodity hardware.

Abstract

Paper Structure (15 sections, 2 equations, 5 figures)

This paper contains 15 sections, 2 equations, 5 figures.

Introduction
Text chunks as a power-law distribution
Stop word text chunks
Discarding uncommon text chunks
Batching token merges
Safe merges
Naive safe merges
Position-sensitive safe merges
Continue searching
Batch merging issues and solutions
Repeated token pairs
Tokenization experiments
Disambiguating stop words as prefixes
Filtering rare text chunks
Conclusion

Figures (5)

Figure 1: Only a small fraction of total text chunks processed are unique.
Figure 2: 75% of the unique text chunks in the FineWeb-Edu 10B sample dataset appear fewer than 4 times.
Figure 3: Several longer words in GPT2's token vocabulary begin with the token for " in".
Figure 4: Preprocessing a few stop words has a marginal impact on encoded length.
Figure 5: Removing rare words from a dataset generally has a slight adverse impact on encoded text length.

Batching BPE Tokenization Merges

TL;DR

Abstract

Batching BPE Tokenization Merges

Authors

TL;DR

Abstract

Table of Contents

Figures (5)