Large Vocabulary Size Improves Large Language Models
Sho Takase, Ryokan Ri, Shun Kiyono, Takuya Kato
TL;DR
The paper investigates how subword vocabulary size affects monolingual LLM quality, focusing on embedding/output parameter budgets that scale with vocabulary size $|V|$. Using SentencePiece unigram, it constructs vocabularies of sizes $5k$, $10k$, $50k$, $100k$, and $500k$ for English and Japanese, and trains Transformer models under two regimes to evaluate downstream performance. For continual training, it proposes a simple vocabulary-reconstruction approach, including the mapping $E_{new} = \frac{W E_{orig}}{\sqrt{|V_{orig}|}}$ and selective embedding insertion for overlapping tokens, which yields improvements over reusing the original vocabulary. The findings demonstrate that vocabulary design is a practical lever for improving monolingual LLM performance and informs both initial training and continual adaptation, though the study is limited to two languages and a maximum vocabulary size of $500k$.
Abstract
This paper empirically investigates the relationship between subword vocabulary size and the performance of large language models (LLMs) to provide insights on how to define the vocabulary size. Experimental results show that larger vocabulary sizes lead to better performance in LLMs. Moreover, we consider a continual training scenario where a pre-trained language model is trained on a different target language. We introduce a simple method to use a new vocabulary instead of the pre-defined one. We show that using the new vocabulary outperforms the model with the vocabulary used in pre-training.
