Large Vocabulary Size Improves Large Language Models

Sho Takase; Ryokan Ri; Shun Kiyono; Takuya Kato

Large Vocabulary Size Improves Large Language Models

Sho Takase, Ryokan Ri, Shun Kiyono, Takuya Kato

TL;DR

The paper investigates how subword vocabulary size affects monolingual LLM quality, focusing on embedding/output parameter budgets that scale with vocabulary size $|V|$. Using SentencePiece unigram, it constructs vocabularies of sizes $5k$, $10k$, $50k$, $100k$, and $500k$ for English and Japanese, and trains Transformer models under two regimes to evaluate downstream performance. For continual training, it proposes a simple vocabulary-reconstruction approach, including the mapping $E_{new} = \frac{W E_{orig}}{\sqrt{|V_{orig}|}}$ and selective embedding insertion for overlapping tokens, which yields improvements over reusing the original vocabulary. The findings demonstrate that vocabulary design is a practical lever for improving monolingual LLM performance and informs both initial training and continual adaptation, though the study is limited to two languages and a maximum vocabulary size of $500k$.

Abstract

This paper empirically investigates the relationship between subword vocabulary size and the performance of large language models (LLMs) to provide insights on how to define the vocabulary size. Experimental results show that larger vocabulary sizes lead to better performance in LLMs. Moreover, we consider a continual training scenario where a pre-trained language model is trained on a different target language. We introduce a simple method to use a new vocabulary instead of the pre-defined one. We show that using the new vocabulary outperforms the model with the vocabulary used in pre-training.

Large Vocabulary Size Improves Large Language Models

TL;DR

The paper investigates how subword vocabulary size affects monolingual LLM quality, focusing on embedding/output parameter budgets that scale with vocabulary size

. Using SentencePiece unigram, it constructs vocabularies of sizes

, and

for English and Japanese, and trains Transformer models under two regimes to evaluate downstream performance. For continual training, it proposes a simple vocabulary-reconstruction approach, including the mapping

and selective embedding insertion for overlapping tokens, which yields improvements over reusing the original vocabulary. The findings demonstrate that vocabulary design is a practical lever for improving monolingual LLM performance and informs both initial training and continual adaptation, though the study is limited to two languages and a maximum vocabulary size of

Large Vocabulary Size Improves Large Language Models

TL;DR

Abstract

Large Vocabulary Size Improves Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents