Table of Contents
Fetching ...

Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training

Woojin Chung, Jeonghoon Kim

TL;DR

The paper investigates why larger vocabularies in tokenizers improve language model performance. Through controlled scaling from 24K to 196K vocabularies and rigorous diagnostics—including Kolmogorov complexity upper bounds and loss-decomposition metrics—the authors show that bigger vocabularies reduce tokenized-text complexity but intensify frequency skew, with the majority of training gains driven by the top 2,500 frequent words. The key mechanism is that lower loss on frequent words translates into lower global cross-entropy and better downstream transfer, an effect that persists across data quality and scales with model size. The findings advocate for principled tokenizer–model co-design using complexity-based objectives and demonstrate that gains from vocabulary growth parallel those from parameter scaling. This work clarifies the role of frequency imbalance in pre-training dynamics and offers actionable guidance for efficient vocabularizer design and scaling strategies.

Abstract

Large language models are trained with tokenizers, and the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favors ever-larger vocabularies, but it is unclear where the benefit comes from. To this end, we perform a controlled study that scales the vocabulary of the language model from 24K to 196K while holding data, computation, and optimization unchanged. We begin by quantifying the complexity of tokenized text -- formalized via Kolmogorov complexity -- and show that larger vocabularies reduce this complexity. Above 24K, every common word is already tokenized as a single token, so enlarging vocabulary only deepens the relative token-frequency imbalance. Word-level loss decomposition shows that larger vocabularies reduce cross-entropy loss almost exclusively by lowering uncertainty on the 2,500 most frequent words, even though loss on the rare tail rises. The same frequent words cover roughly 75% of tokens in downstream benchmarks, so this training advantage transfers intact. We further show that enlarging model parameters with a fixed vocabulary yields the same frequent-word benefit. Our results recast "bigger vocabularies help" as "lowering complexity of tokenized text helps," offering a simple, principled knob for tokenizer-model co-design and clarifying the loss dynamics that govern language model scaling in pre-training.

Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training

TL;DR

The paper investigates why larger vocabularies in tokenizers improve language model performance. Through controlled scaling from 24K to 196K vocabularies and rigorous diagnostics—including Kolmogorov complexity upper bounds and loss-decomposition metrics—the authors show that bigger vocabularies reduce tokenized-text complexity but intensify frequency skew, with the majority of training gains driven by the top 2,500 frequent words. The key mechanism is that lower loss on frequent words translates into lower global cross-entropy and better downstream transfer, an effect that persists across data quality and scales with model size. The findings advocate for principled tokenizer–model co-design using complexity-based objectives and demonstrate that gains from vocabulary growth parallel those from parameter scaling. This work clarifies the role of frequency imbalance in pre-training dynamics and offers actionable guidance for efficient vocabularizer design and scaling strategies.

Abstract

Large language models are trained with tokenizers, and the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favors ever-larger vocabularies, but it is unclear where the benefit comes from. To this end, we perform a controlled study that scales the vocabulary of the language model from 24K to 196K while holding data, computation, and optimization unchanged. We begin by quantifying the complexity of tokenized text -- formalized via Kolmogorov complexity -- and show that larger vocabularies reduce this complexity. Above 24K, every common word is already tokenized as a single token, so enlarging vocabulary only deepens the relative token-frequency imbalance. Word-level loss decomposition shows that larger vocabularies reduce cross-entropy loss almost exclusively by lowering uncertainty on the 2,500 most frequent words, even though loss on the rare tail rises. The same frequent words cover roughly 75% of tokens in downstream benchmarks, so this training advantage transfers intact. We further show that enlarging model parameters with a fixed vocabulary yields the same frequent-word benefit. Our results recast "bigger vocabularies help" as "lowering complexity of tokenized text helps," offering a simple, principled knob for tokenizer-model co-design and clarifying the loss dynamics that govern language model scaling in pre-training.

Paper Structure

This paper contains 35 sections, 5 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Figure \ref{['fig:2a']} shows that increasing vocabulary size exacerbates relative token-frequency imbalance. In other words, enlarging the vocabulary size introduces more rare tokens, causing the relative token-frequency distribution to be further from a uniform distribution. Figure \ref{['fig:2b']} reveals that a $24$K vocabulary size tokenizer already segments $2,500$ frequent words as a single token regardless of dataset quality. This implies that further vocabulary growth offers no added benefit for estimating the probabilities of frequent words. Figure \ref{['fig:2c']} shows that the most frequent $n$ words in fineWeb-Edu and OpenWebText largely overlap, highlighting the universality of frequent vocabulary across different datasets. We report the most frequent $2,500$ words in FineWeb-Edu and OpenWebText, which account for approximately $74.4\%$ and $75.5\%$ of each dataset, respectively.
  • Figure 2: Figure \ref{['fig:3a']} illustrates that models with a larger vocabulary size reduce loss on the most frequent $2,500$ words while increase loss on the rarest $20,000$ words. Nevertheless, Figure \ref{['fig:3b']} shows that the global cross‐entropy loss declines as vocabulary size increases, demonstrating that the gains from lower loss on frequent words outweigh the losses from poorer infrequent word estimates. It further reveals that frequent words account for nearly $75$% of the total loss, while loss on infrequent words grows with vocabulary size as their conditional probabilities fall due to data sparsity. Models are pre-trained on $40\mathrm{B}$ bytes and evaluated on a disjoint $5\mathrm{B}$ byte split of FineWeb-Edu.
  • Figure 3: For an $85\mathrm{M}$ model trained on $30\mathrm{B}$ tokens, larger vocabularies reduce the most frequent $2,500$ word loss while increase the rarest $20,000$ word loss; since frequent words dominate, global cross-entropy drops (figure \ref{['fig:30a']} and \ref{['fig:30b']}). $450\mathrm{M}$ model trained on 10B tokens mirrors the pattern (figure \ref{['fig:30c']} and \ref{['fig:30d']}), indicating that these vocabulary-size effects persist across larger datasets and models.
  • Figure 4: Figure \ref{['fig:4a']} demonstrates that the most frequent $2,500$ words in the FineWeb-Edu comprise nearly $72-78$% of the tokens in other downstream benchmark datasets as well as the CC-Main-$2023-40$huang2024compressionrepresentsintelligencelinearly. ARC refers to ARC-Easy, and HS refers to HellaSwag. Figure \ref{['fig:4b']} illustrates that a larger vocabulary reduces average per-word loss on frequent FineWeb-Edu words within the CC dataset, and demonstrates how this translates into lower global cross-entropy loss on CC dataset. Figure \ref{['fig:4c']} confirms that scaling the vocabulary size boosts downstream task performance.
  • Figure 5: Figure \ref{['fig:5a']} illustrates that increasing model size reduces loss on high frequency words, and the global cross-entropy loss of larger models is overwhelmingly driven by frequent word losses mirroring the effect of increased vocabulary size. However, unlike the pattern in figure \ref{['fig:3a']}, scaling up model size does not exacerbate errors on infrequent tokens. Figure \ref{['fig:5b']} demonstrates that the global cross-entropy loss declines as model size increases, showing the same tendency of scaling up the vocabulary size (figure \ref{['fig:3b']}).
  • ...and 3 more figures