Table of Contents
Fetching ...

Batching BPE Tokenization Merges

Alexander P. Morgan

TL;DR

The paper addresses the challenge of training BPE tokenizers on compute- and memory-constrained hardware. It introduces BatchBPE, a pure-Python implementation that batches merges, reducing the training cost from $Tokens \times VocabSize$ to $Tokens \times NumBatches$ and relies on a chunk-frequency dictionary to process text efficiently. It investigates stop-word preprocessing and filtering of rare chunks (freq_cutoff) and demonstrates how these settings modestly affect encoded text length, providing a practical framework for tokenization experiments. The work culminates in an open-source tool and a view that batching will be increasingly important as vocabularies grow, enabling researchers to prototype tokenization strategies on commodity hardware.

Abstract

The Byte Pair Encoding algorithm can be safely batched to merge hundreds of pairs of tokens at a time when building up a tokenizer's vocabulary. This technique combined with reducing the memory footprint of text used in vocabulary training make it feasible to train a high quality tokenizer on a basic laptop. This paper presents BatchBPE, an open-source pure Python implementation of these concepts, with the goal of making experimenting with new tokenization strategies more accessible especially in compute- and memory-constrained contexts. BatchBPE's usefulness and malleability are demonstrated through the training of several token vocabularies to explore the batch merging process and experiment with preprocessing a stop word list and ignoring the least common text chunks in a dataset. Resultant encoded lengths of texts are used as a basic evaluation metric.

Batching BPE Tokenization Merges

TL;DR

The paper addresses the challenge of training BPE tokenizers on compute- and memory-constrained hardware. It introduces BatchBPE, a pure-Python implementation that batches merges, reducing the training cost from to and relies on a chunk-frequency dictionary to process text efficiently. It investigates stop-word preprocessing and filtering of rare chunks (freq_cutoff) and demonstrates how these settings modestly affect encoded text length, providing a practical framework for tokenization experiments. The work culminates in an open-source tool and a view that batching will be increasingly important as vocabularies grow, enabling researchers to prototype tokenization strategies on commodity hardware.

Abstract

The Byte Pair Encoding algorithm can be safely batched to merge hundreds of pairs of tokens at a time when building up a tokenizer's vocabulary. This technique combined with reducing the memory footprint of text used in vocabulary training make it feasible to train a high quality tokenizer on a basic laptop. This paper presents BatchBPE, an open-source pure Python implementation of these concepts, with the goal of making experimenting with new tokenization strategies more accessible especially in compute- and memory-constrained contexts. BatchBPE's usefulness and malleability are demonstrated through the training of several token vocabularies to explore the batch merging process and experiment with preprocessing a stop word list and ignoring the least common text chunks in a dataset. Resultant encoded lengths of texts are used as a basic evaluation metric.
Paper Structure (15 sections, 2 equations, 5 figures)

This paper contains 15 sections, 2 equations, 5 figures.

Figures (5)

  • Figure 1: Only a small fraction of total text chunks processed are unique.
  • Figure 2: 75% of the unique text chunks in the FineWeb-Edu 10B sample dataset appear fewer than 4 times.
  • Figure 3: Several longer words in GPT2's token vocabulary begin with the token for " in".
  • Figure 4: Preprocessing a few stop words has a marginal impact on encoded length.
  • Figure 5: Removing rare words from a dataset generally has a slight adverse impact on encoded text length.