Getting the most out of your tokenizer for pre-training and domain adaptation

Gautier Dagan; Gabriel Synnaeve; Baptiste Rozière

Getting the most out of your tokenizer for pre-training and domain adaptation

Gautier Dagan, Gabriel Synnaeve, Baptiste Rozière

TL;DR

This work demonstrates that tokenizer design—specifically size, pre-tokenization, and training data—has a substantial impact on LLM speed, memory, and downstream code-generation performance. Through large-scale ablations on 1.5B and 7B code-models, it shows that specialized code tokenizers can markedly improve compression without hurting accuracy, and that changing a pre-trained model's tokenizer during fine-tuning becomes effective when the model has seen on the order of 50B tokens. It provides practical guidelines on selecting vocabulary size and pre-tokenization schemes (favoring GPT-4-like regex) and proves that tokenizer transfer or extension can yield gains with minimal disruption after sufficient data. The findings push practitioners to treat tokenization as a domain-adaptation lever, enabling faster inference and larger effective context sizes in code-oriented LLMs.

Abstract

Tokenization is an understudied and often neglected component of modern LLMs. Most published works use a single tokenizer for all experiments, often borrowed from another model, without performing ablations or analysis to optimize tokenization. Moreover, the tokenizer is generally kept unchanged when fine-tuning a base model. In this paper, we show that the size, pre-tokenization regular expression, and training data of a tokenizer can significantly impact the model's generation speed, effective context size, memory usage, and downstream performance. We train specialized Byte-Pair Encoding code tokenizers, and conduct extensive ablations on the impact of tokenizer design on the performance of LLMs for code generation tasks such as HumanEval and MBPP, and provide recommendations for tokenizer hyper-parameters selection and switching the tokenizer in a pre-trained LLM. We perform our experiments on models trained from scratch and from pre-trained models, verifying their applicability to a wide range of use-cases. We find that when fine-tuning on more than 50 billion tokens, we can specialize the tokenizer of a pre-trained LLM to obtain large gains in generation speed and effective context size.

Getting the most out of your tokenizer for pre-training and domain adaptation

TL;DR

Abstract

Paper Structure (32 sections, 5 equations, 9 figures, 11 tables)

This paper contains 32 sections, 5 equations, 9 figures, 11 tables.

Introduction
Compression trade-offs
Compression metrics
Algorithm
Data
Pre-tokenization
Pre-tokenizers based on regular expressions.
Vocabulary Size
Optimal Vocabulary Size
Code Tokenizers Experiments
How much data?
Influence of tokenizer size
Tokenizer update methods
Vocabulary Transfer
Tokenizer Extension
...and 17 more sections

Figures (9)

Figure 1: Three ways to increase in-domain compression in a BPE tokenizer with their respective trade-offs.
Figure 2: Tokenizers trained with different % of code, English, multilingual data. Unsurprisingly, training on code improves code compression, training on multilingual data improves multilingual compression, and training on an even mix of all three subset leads to the best average compression.
Figure 3: The GPT-2 gpt2 and GPT-4 openai2023gpt4 pre-tokenization regular expressions decomposed into functional sub-parts, and another version dubbed Punct which we introduce to ablate some of the changes introduced in GPT-4. Punct does away with the English-specific contractions and prevents certain whitespace and punctuation tokens such as \\ t or . to be encoded at the start of an alpha-only token (see Appendix \ref{['sec:example']} for an example).
Figure 4: (top left) For given fixed set of tokenizer settings, we measure the Code NSL of different vocabulary sizes. We set the reference point to the tokenizer trained @32k tokens to compare against. (top middle) We measure the inference time for a set of vocabulary sizes and models with a fixed sequence length of 4096, and plot a linear regression over observations. We normalize predictions to a vocab of 32k. (top right) By combining the compression and inference time trade-offs, we obtain a simple cost function that describes an optimal inference time. (bottom) We use equation \ref{['eq:eq1']} to find the memory optimal vocabulary size for different models. Llama 2 34B uses grouped-query attention, which significantly reduces the cache's memory usage and the memory-optimal vocabulary size.
Figure 5: Performance vs Code NSL. We plot the HumanEval Pass@1 performance against Code NSL for our 1.5B LLMs fine-tuned with different base models and tokenizers.
...and 4 more figures

Getting the most out of your tokenizer for pre-training and domain adaptation

TL;DR

Abstract

Getting the most out of your tokenizer for pre-training and domain adaptation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)