Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training
Woojin Chung, Jeonghoon Kim
TL;DR
The paper investigates why larger vocabularies in tokenizers improve language model performance. Through controlled scaling from 24K to 196K vocabularies and rigorous diagnostics—including Kolmogorov complexity upper bounds and loss-decomposition metrics—the authors show that bigger vocabularies reduce tokenized-text complexity but intensify frequency skew, with the majority of training gains driven by the top 2,500 frequent words. The key mechanism is that lower loss on frequent words translates into lower global cross-entropy and better downstream transfer, an effect that persists across data quality and scales with model size. The findings advocate for principled tokenizer–model co-design using complexity-based objectives and demonstrate that gains from vocabulary growth parallel those from parameter scaling. This work clarifies the role of frequency imbalance in pre-training dynamics and offers actionable guidance for efficient vocabularizer design and scaling strategies.
Abstract
Large language models are trained with tokenizers, and the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favors ever-larger vocabularies, but it is unclear where the benefit comes from. To this end, we perform a controlled study that scales the vocabulary of the language model from 24K to 196K while holding data, computation, and optimization unchanged. We begin by quantifying the complexity of tokenized text -- formalized via Kolmogorov complexity -- and show that larger vocabularies reduce this complexity. Above 24K, every common word is already tokenized as a single token, so enlarging vocabulary only deepens the relative token-frequency imbalance. Word-level loss decomposition shows that larger vocabularies reduce cross-entropy loss almost exclusively by lowering uncertainty on the 2,500 most frequent words, even though loss on the rare tail rises. The same frequent words cover roughly 75% of tokens in downstream benchmarks, so this training advantage transfers intact. We further show that enlarging model parameters with a fixed vocabulary yields the same frequent-word benefit. Our results recast "bigger vocabularies help" as "lowering complexity of tokenized text helps," offering a simple, principled knob for tokenizer-model co-design and clarifying the loss dynamics that govern language model scaling in pre-training.
