Table of Contents
Fetching ...

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Hongzhi Huang, Defa Zhu, Banggu Wu, Yutao Zeng, Ya Wang, Qiyang Min, Xun Zhou

TL;DR

This work shows that decoupling input and output vocabularies and scaling the input vocabulary through multi-gram embeddings can substantially improve language model performance across model sizes. The authors introduce Over-Encoding (OE), Over-Decoding (OD), and the integrated Over-Tokenized Transformer (OT), and demonstrate a log-linear relationship between input vocabulary size and training loss. Empirical results across dense and MoE models, plus ablations on vocabulary design, reveal that larger input vocabularies yield consistent gains while large output vocabularies can hinder smaller models, highlighting tokenizer design as a critical scaling factor. The practical contribution includes efficient embedding parameterization and engineering strategies that keep overhead minimal, suggesting tokenizer design as a first-class component in future LLM scaling.

Abstract

Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

TL;DR

This work shows that decoupling input and output vocabularies and scaling the input vocabulary through multi-gram embeddings can substantially improve language model performance across model sizes. The authors introduce Over-Encoding (OE), Over-Decoding (OD), and the integrated Over-Tokenized Transformer (OT), and demonstrate a log-linear relationship between input vocabulary size and training loss. Empirical results across dense and MoE models, plus ablations on vocabulary design, reveal that larger input vocabularies yield consistent gains while large output vocabularies can hinder smaller models, highlighting tokenizer design as a critical scaling factor. The practical contribution includes efficient embedding parameterization and engineering strategies that keep overhead minimal, suggesting tokenizer design as a first-class component in future LLM scaling.

Abstract

Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.

Paper Structure

This paper contains 48 sections, 9 equations, 14 figures, 10 tables, 2 algorithms.

Figures (14)

  • Figure 1: Scaling trend for Over-Encoded models and baselines on OLMo2. We plot the loss with 400B tokens' training. For over-encoding, input vocabulary size is extended from 0.1 to 1.2 and 12.8 million ($12\times$ and $128\times$ larger than baseline), referred to as OE-1.2M and OE-12.8M. We observe OE-12.8M with 400M parameters matches the baseline with 1B parameters.
  • Figure 2: Performance comparison for models trained on CFG data. The left panel compares 1-gram and 3-gram tokenizers, showing that 3-gram improves larger (85M parameters) models but harms smaller (2.4M parameters) ones. The right panel examines 3-gram usage in encoders and decoders, revealing consistent gains with 3-gram encoders regardless of model size, while 3-gram decoders degrade performance in smaller models.
  • Figure 3: Illustration of 2-gram encoding/decoding GPT. Note that 2-gram decoding only preserves the predicted next 1 token though next 2 is predicted, which keeps inference cost identical to the vanilla model.
  • Figure 4: Training curves for OE-12.8M and baseline model on OLMo2-1B. The metrics are smoothed via exponential moving average with weight 0.99 for loss and 0.9 for downstream tasks. We observe significant convergence acceleration for the OE model: $5.7\times$ on loss, $3.2\times$ on MMLU-Var, $3.0\times$ on Hellaswag, $2.6\times$ on ARC-Challenge, $3.1\times$ on ARC-Easy and $3.9\times$ on PIQA.
  • Figure 5: Log-linear relationship is observed between vocabulary size $m$ and training loss $\mathcal{L}$, i.e. $\mathcal{L}=2.6754-0.0256 \times\log_{10}{m}$. The values are collected with 500B tokens' training on OLMoE-1.3B models.
  • ...and 9 more figures