How does a Language-Specific Tokenizer affect LLMs?
Jean Seo, Jaeyoon Kim, SungJoo Byun, Hyopil Shin
TL;DR
The paper tackles how language-specific tokenizers influence LLM behavior, focusing on Korean as a non-English case study. It builds a Korean-specific extended tokenizer by augmenting a SentencePiece BPE vocabulary and demonstrates intrinsic effects using a Next Token Prediction framework across varying difficulties and target units. The main findings show that extended tokenizers reduce confidence in incorrect predictions and lower cross-entropy in complex tasks, indicating more stable and sensible generation with potential downstream benefits. This work contributes an intrinsic tokenizer analysis framework and empirical evidence supporting language-specific tokenizer extensions, informing tokenizer design for improved multilingual generation and model explainability.
Abstract
The necessity of language-specific tokenizers intuitively appears crucial for effective natural language processing, yet empirical analyses on their significance and underlying reasons are lacking. This study explores how language-specific tokenizers influence the behavior of Large Language Models predominantly trained with English text data, through the case study of Korean. The research unfolds in two main stages: (1) the development of a Korean-specific extended tokenizer and (2) experiments to compare models with the basic tokenizer and the extended tokenizer through various Next Token Prediction tasks. Our in-depth analysis reveals that the extended tokenizer decreases confidence in incorrect predictions during generation and reduces cross-entropy in complex tasks, indicating a tendency to produce less nonsensical outputs. Consequently, the extended tokenizer provides stability during generation, potentially leading to higher performance in downstream tasks.
