Getting the most out of your tokenizer for pre-training and domain adaptation
Gautier Dagan, Gabriel Synnaeve, Baptiste Rozière
TL;DR
This work demonstrates that tokenizer design—specifically size, pre-tokenization, and training data—has a substantial impact on LLM speed, memory, and downstream code-generation performance. Through large-scale ablations on 1.5B and 7B code-models, it shows that specialized code tokenizers can markedly improve compression without hurting accuracy, and that changing a pre-trained model's tokenizer during fine-tuning becomes effective when the model has seen on the order of 50B tokens. It provides practical guidelines on selecting vocabulary size and pre-tokenization schemes (favoring GPT-4-like regex) and proves that tokenizer transfer or extension can yield gains with minimal disruption after sufficient data. The findings push practitioners to treat tokenization as a domain-adaptation lever, enabling faster inference and larger effective context sizes in code-oriented LLMs.
Abstract
Tokenization is an understudied and often neglected component of modern LLMs. Most published works use a single tokenizer for all experiments, often borrowed from another model, without performing ablations or analysis to optimize tokenization. Moreover, the tokenizer is generally kept unchanged when fine-tuning a base model. In this paper, we show that the size, pre-tokenization regular expression, and training data of a tokenizer can significantly impact the model's generation speed, effective context size, memory usage, and downstream performance. We train specialized Byte-Pair Encoding code tokenizers, and conduct extensive ablations on the impact of tokenizer design on the performance of LLMs for code generation tasks such as HumanEval and MBPP, and provide recommendations for tokenizer hyper-parameters selection and switching the tokenizer in a pre-trained LLM. We perform our experiments on models trained from scratch and from pre-trained models, verifying their applicability to a wide range of use-cases. We find that when fine-tuning on more than 50 billion tokens, we can specialize the tokenizer of a pre-trained LLM to obtain large gains in generation speed and effective context size.
