Table of Contents
Fetching ...

Accelerating Multilingual Language Model for Excessively Tokenized Languages

Jimin Hong, Gibbeum Lee, Jaewoong Cho

TL;DR

This work tackles the tokenization inefficiency of English-centric multilingual LLMs when handling non-alphabetic languages. It introduces MuMo, a framework that appends a Target Monolingual LM Head to a frozen pretrained multilingual model, concatenating its outputs to predict next tokens, and trains this head with a small target-language corpus. Through a two-step inference process—top-k candidate selection and a verification step using the base model—MuMo achieves substantial speedups (approximately 1.7x) on Korean and Japanese while preserving generation quality on summarization and translation tasks. The approach is data-efficient, avoids full pretraining, and offers a practical path to faster multilingual generation for languages with heavy token fragmentation. Limitations include evaluation on a limited set of languages and model sizes, with future work extending to more languages and larger models.

Abstract

Recent advancements in large language models (LLMs) have remarkably enhanced performances on a variety of tasks in multiple languages. However, tokenizers in LLMs trained primarily on English-centric corpora often overly fragment a text into character or Unicode-level tokens in non-Roman alphabetic languages, leading to inefficient text generation. We introduce a simple yet effective framework to accelerate text generation in such languages. Our approach involves employing a new language model head with a vocabulary set tailored to a specific target language for a pre-trained LLM. This is followed by fine-tuning the new head while incorporating a verification step to ensure the model's performance is preserved. We show that this targeted fine-tuning, while freezing other model parameters, effectively reduces token fragmentation for the target language. Our extensive experiments demonstrate that the proposed framework increases the generation speed by a factor of 1.7 while maintaining the performance of pre-trained multilingual models on target monolingual tasks.

Accelerating Multilingual Language Model for Excessively Tokenized Languages

TL;DR

This work tackles the tokenization inefficiency of English-centric multilingual LLMs when handling non-alphabetic languages. It introduces MuMo, a framework that appends a Target Monolingual LM Head to a frozen pretrained multilingual model, concatenating its outputs to predict next tokens, and trains this head with a small target-language corpus. Through a two-step inference process—top-k candidate selection and a verification step using the base model—MuMo achieves substantial speedups (approximately 1.7x) on Korean and Japanese while preserving generation quality on summarization and translation tasks. The approach is data-efficient, avoids full pretraining, and offers a practical path to faster multilingual generation for languages with heavy token fragmentation. Limitations include evaluation on a limited set of languages and model sizes, with future work extending to more languages and larger models.

Abstract

Recent advancements in large language models (LLMs) have remarkably enhanced performances on a variety of tasks in multiple languages. However, tokenizers in LLMs trained primarily on English-centric corpora often overly fragment a text into character or Unicode-level tokens in non-Roman alphabetic languages, leading to inefficient text generation. We introduce a simple yet effective framework to accelerate text generation in such languages. Our approach involves employing a new language model head with a vocabulary set tailored to a specific target language for a pre-trained LLM. This is followed by fine-tuning the new head while incorporating a verification step to ensure the model's performance is preserved. We show that this targeted fine-tuning, while freezing other model parameters, effectively reduces token fragmentation for the target language. Our extensive experiments demonstrate that the proposed framework increases the generation speed by a factor of 1.7 while maintaining the performance of pre-trained multilingual models on target monolingual tasks.
Paper Structure (44 sections, 5 equations, 4 figures, 17 tables)

This paper contains 44 sections, 5 equations, 4 figures, 17 tables.

Figures (4)

  • Figure 1: Analysis of tokenization lengths and language distribution in pretraining corpus with percentage >=0.04% English script comprises 89.7% of the corpus and has an average token length of 29.6 in FLoRes-200. The languages using the Chinese, Japanese, and Korean (CJK) scripts have longer tokenization lengths compared to those using Latin and Cyrillic scripts. Our primary focus is on languages that are excessively tokenized by English-centric tokenizers.
  • Figure 2: Overview of the proposed framework. Illustration of (Left) the generation with a pre-trained multilingual model and (Right) the generation of MuMo Framework. Given the Korean prefix "천왕성은" (Uranus is), the model generates the consecutive phrase "태양으로부터"(from the Sun) that consisted of 3 morphemes ("태양", "으로", "부터") in Korean. The generation with the pre-trained multilingual model faces inefficiency due to excessive fragmentation, requiring 12 steps to generate only 3 Korean morphemes. However, the MuMo framework empowers the multilingual language model to generate multiple tokens in a single iteration by extracting a word from the Korean Vocabulary, requiring 3 steps.
  • Figure 3: Illustration of a single-step prediction with MuMo. Initially, the MuMo LM Head $f_{\text{mumo}}$ selects the top 6 candidates. Then, the pre-trained multilingual model verifies the feasibility of the candidates. Among the modules in MuMo, the Target Monolingual LM head (the Korean LM Head in the figure) is only trained.
  • Figure 4: Evaluation on multiple-task after training on QA and Summarization task. The red dotted lines represent the average grading of single answers derived from the instruction-tuned multilingual language model. The decline is less pronounced with MuMo, suggesting its relative effectiveness in preserving the model's multi-task proficiency.