Table of Contents
Fetching ...

How does a Language-Specific Tokenizer affect LLMs?

Jean Seo, Jaeyoon Kim, SungJoo Byun, Hyopil Shin

TL;DR

The paper tackles how language-specific tokenizers influence LLM behavior, focusing on Korean as a non-English case study. It builds a Korean-specific extended tokenizer by augmenting a SentencePiece BPE vocabulary and demonstrates intrinsic effects using a Next Token Prediction framework across varying difficulties and target units. The main findings show that extended tokenizers reduce confidence in incorrect predictions and lower cross-entropy in complex tasks, indicating more stable and sensible generation with potential downstream benefits. This work contributes an intrinsic tokenizer analysis framework and empirical evidence supporting language-specific tokenizer extensions, informing tokenizer design for improved multilingual generation and model explainability.

Abstract

The necessity of language-specific tokenizers intuitively appears crucial for effective natural language processing, yet empirical analyses on their significance and underlying reasons are lacking. This study explores how language-specific tokenizers influence the behavior of Large Language Models predominantly trained with English text data, through the case study of Korean. The research unfolds in two main stages: (1) the development of a Korean-specific extended tokenizer and (2) experiments to compare models with the basic tokenizer and the extended tokenizer through various Next Token Prediction tasks. Our in-depth analysis reveals that the extended tokenizer decreases confidence in incorrect predictions during generation and reduces cross-entropy in complex tasks, indicating a tendency to produce less nonsensical outputs. Consequently, the extended tokenizer provides stability during generation, potentially leading to higher performance in downstream tasks.

How does a Language-Specific Tokenizer affect LLMs?

TL;DR

The paper tackles how language-specific tokenizers influence LLM behavior, focusing on Korean as a non-English case study. It builds a Korean-specific extended tokenizer by augmenting a SentencePiece BPE vocabulary and demonstrates intrinsic effects using a Next Token Prediction framework across varying difficulties and target units. The main findings show that extended tokenizers reduce confidence in incorrect predictions and lower cross-entropy in complex tasks, indicating more stable and sensible generation with potential downstream benefits. This work contributes an intrinsic tokenizer analysis framework and empirical evidence supporting language-specific tokenizer extensions, informing tokenizer design for improved multilingual generation and model explainability.

Abstract

The necessity of language-specific tokenizers intuitively appears crucial for effective natural language processing, yet empirical analyses on their significance and underlying reasons are lacking. This study explores how language-specific tokenizers influence the behavior of Large Language Models predominantly trained with English text data, through the case study of Korean. The research unfolds in two main stages: (1) the development of a Korean-specific extended tokenizer and (2) experiments to compare models with the basic tokenizer and the extended tokenizer through various Next Token Prediction tasks. Our in-depth analysis reveals that the extended tokenizer decreases confidence in incorrect predictions during generation and reduces cross-entropy in complex tasks, indicating a tendency to produce less nonsensical outputs. Consequently, the extended tokenizer provides stability during generation, potentially leading to higher performance in downstream tasks.

Paper Structure

This paper contains 17 sections, 4 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: An example comparing how the Llama-2 base tokenizer and our extended tokenizer process the same sentence. The example sentence consists of 13 characters, including the special token <s> and whitespace. The base tokenizer segments 3 of these characters into bytes, resulting in a total of 19 tokens. In contrast, our extended tokenizer appropriately tokenizes the sentence into meaningful units of subwords, resulting in only 8 tokens.
  • Figure 2: Example of the NTP task input in easy and hard version respectively. In the easy version, the answer is already provided to the model since the input includes the original test sentence. In contrast, in the hard version, only the sequence preceding the target is provided, making it more difficult for the model to predict the correct token.
  • Figure 3: Example of the NTP task input in the three different target units. At the token level, the target is tokenized into a single token by both tokenizers. At the character level, the target is segmented into 3 tokens by the base tokenizer but is treated as a single token by the extended tokenizer. At the word level, both tokenizers split the target into multiple tokens.
  • Figure 4: Accuracy
  • Figure 5: Normalized Confidence Level
  • ...and 4 more figures