Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Chaofan Tao; Qian Liu; Longxu Dou; Niklas Muennighoff; Zhongwei Wan; Ping Luo; Min Lin; Ngai Wong

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong

TL;DR

The paper reveals that vocabulary size meaningfully shapes LLM scaling laws and is often under-allocated in practice. It introduces a normalized unigram-based loss to enable fair cross-V comparisons and presents three complementary methods—IsoFLOPs power-law fitting, derivative-based optimization, and a parametric loss model—to predict the compute-optimal vocabulary. Empirical validation on 3B-parameter models shows that using the predicted optimal vocabulary improves downstream performance under the same FLOPs budget, with concrete gains demonstrated on tasks like ARC-Challenge when increasing V from 32K to 43K. The work underscores the necessity of jointly optimizing vocabulary, non-vocabulary parameters, and training data, and provides practical tools and predictions applicable to larger models and data regimes.

Abstract

Research on scaling large language models (LLMs) has primarily focused on model parameters and training data size, overlooking the role of vocabulary size. We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We propose three complementary approaches for predicting the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. Our approaches converge on the conclusion that the optimal vocabulary size depends on the compute budget, with larger models requiring larger vocabularies. Most LLMs, however, use insufficient vocabulary sizes. For example, we predict that the optimal vocabulary size of Llama2-70B should have been at least 216K, 7 times larger than its vocabulary of 32K. We validate our predictions empirically by training models with 3B parameters across different FLOPs budgets. Adopting our predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes. By increasing the vocabulary size from the conventional 32K to 43K, we improve performance on ARC-Challenge from 29.1 to 32.0 with the same 2.3e21 FLOPs. Our work highlights the importance of jointly considering tokenization and model scaling for efficient pre-training. The code and demo are available at https://github.com/sail-sg/scaling-with-vocab and https://hf.co/spaces/sail/scaling-with-vocab-demo.

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

TL;DR

Abstract

Paper Structure (51 sections, 17 equations, 14 figures, 5 tables)

This paper contains 51 sections, 17 equations, 14 figures, 5 tables.

Introduction
Preliminary
Scaling law
Scaling law with vocabulary
Analysis: Why the optimal vocabulary size is bounded by compute
Estimating the optimal vocabulary size
Approach 1: Estimating power laws via IsoFLOPs
Approach 2: Derivative-based fast estimation
Approach 3: Parametric fit of loss formula
Discussion
Predicting allocations for larger models
Experiments with scarce and excessive training data
Related work
Language models
Scaling laws
...and 36 more sections

Figures (14)

Figure 1: The relationship between non-vocabulary parameters $N_{\rm nv}$ and the corresponding optimal vocabulary parameters $N_{\rm v}^{\rm opt}$ follows a power law, where $N_{\rm v}^{\rm opt}$ should be scaled slower than $N_{\rm nv}$ as $\gamma < 1$. Empirical results align with predictions of our proposed approaches, with larger circles indicating higher loss values. Here $V$ refers to the vocabulary size i.e. the number of distinct tokens.
Figure 2: Vocabulary parameters of popular LLMs and predicted optimal vocabulary parameters at a compute-optimal number of training tokens. Most current LLMs have suboptimal vocabulary parameters due to vocabulary sizes, which are smaller than the predicted optimal values. Among the current models, StarCoder2-3B, OLMo-7B, InternLM2-20B, and Gemma2-27B have vocabulary sizes that come closest to the optimal allocation for their respective model sizes.
Figure 3: Left: FLOPs curve with various vocabulary sizes, assuming all configurations achieve a fixed loss. There exists an optimal vocabulary size that minimizes FLOPs. Right: Loss curves with various vocabulary sizes given different FLOP budgets. For each budget there exists an optimal vocabulary size that minimizes loss. As the FLOP budget increases this optimal vocabulary size increases (shifts to the right).
Figure 4: Training curves of the experiments used in Approach 1 (\ref{['sec:isoflops']}) and Approach 3 (\ref{['sec:parametric']}). We train models with the non-vocabulary parameters fixed and vocabulary sizes varying from 4K to 96K.
Figure 5: Fitting results of the Approach 1. Blue stars denote the selected data points where the combination ($N_{\rm nv}$, $N_{\rm v}$, $H$) reaches the lowest loss given various FLOPs budgets. We find power law fits with respect to the optimal non-vocabulary parameters, vocabulary parameters, and the number of training characters, respectively.
...and 9 more figures

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

TL;DR

Abstract

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Authors

TL;DR

Abstract

Table of Contents

Figures (14)