Table of Contents
Fetching ...

Parallel Tokenizers: Rethinking Vocabulary Design for Cross-Lingual Transfer

Muhammad Dehan Al Kautsar, Fajri Koto

TL;DR

This work tackles cross-lingual transfer bottlenecks rooted in tokenization by introducing parallel tokenizers that align vocabulary across languages via a monolingual pivot and word-level mapping, with language identity cues to preserve language-specificity. The authors construct parallel vocabularies by translating word-type tokens, expanding with monolingual vocabularies, and concatenating while capping size, achieving substantial cross-language alignment and improved fertility characteristics. Pretraining from scratch on 13 low-resource languages shows that parallel tokenizers consistently outperform traditional baselines on sentiment, hate speech, emotion classification, and cross-lingual sentence representations, with pronounced gains under limited target-language data and stronger cross-lingual coherence after continual pretraining. The results imply that tokenization design is a critical lever for multilingual representation learning, enabling more effective cross-lingual transfer and scalable language addition in low-resource contexts.

Abstract

Tokenization defines the foundation of multilingual language models by determining how words are represented and shared across languages. However, existing methods often fail to support effective cross-lingual transfer because semantically equivalent words are assigned distinct embeddings. For example, "I eat rice" in English and "Ina cin shinkafa" in Hausa are typically mapped to different vocabulary indices, preventing shared representations and limiting cross-lingual generalization. We introduce parallel tokenizers. This new framework trains tokenizers monolingually and then aligns their vocabularies exhaustively using bilingual dictionaries or word-to-word translation, ensuring consistent indices for semantically equivalent words. This alignment enforces a shared semantic space across languages while naturally improving fertility balance. To assess their effectiveness, we pretrain a transformer encoder from scratch on thirteen low-resource languages and evaluate it on sentiment analysis, hate speech detection, emotion classification, and sentence embedding similarity. Across all tasks, models trained with parallel tokenizers outperform conventional multilingual baselines, confirming that rethinking tokenization is essential for advancing multilingual representation learning--especially in low-resource settings.

Parallel Tokenizers: Rethinking Vocabulary Design for Cross-Lingual Transfer

TL;DR

This work tackles cross-lingual transfer bottlenecks rooted in tokenization by introducing parallel tokenizers that align vocabulary across languages via a monolingual pivot and word-level mapping, with language identity cues to preserve language-specificity. The authors construct parallel vocabularies by translating word-type tokens, expanding with monolingual vocabularies, and concatenating while capping size, achieving substantial cross-language alignment and improved fertility characteristics. Pretraining from scratch on 13 low-resource languages shows that parallel tokenizers consistently outperform traditional baselines on sentiment, hate speech, emotion classification, and cross-lingual sentence representations, with pronounced gains under limited target-language data and stronger cross-lingual coherence after continual pretraining. The results imply that tokenization design is a critical lever for multilingual representation learning, enabling more effective cross-lingual transfer and scalable language addition in low-resource contexts.

Abstract

Tokenization defines the foundation of multilingual language models by determining how words are represented and shared across languages. However, existing methods often fail to support effective cross-lingual transfer because semantically equivalent words are assigned distinct embeddings. For example, "I eat rice" in English and "Ina cin shinkafa" in Hausa are typically mapped to different vocabulary indices, preventing shared representations and limiting cross-lingual generalization. We introduce parallel tokenizers. This new framework trains tokenizers monolingually and then aligns their vocabularies exhaustively using bilingual dictionaries or word-to-word translation, ensuring consistent indices for semantically equivalent words. This alignment enforces a shared semantic space across languages while naturally improving fertility balance. To assess their effectiveness, we pretrain a transformer encoder from scratch on thirteen low-resource languages and evaluate it on sentiment analysis, hate speech detection, emotion classification, and sentence embedding similarity. Across all tasks, models trained with parallel tokenizers outperform conventional multilingual baselines, confirming that rethinking tokenization is essential for advancing multilingual representation learning--especially in low-resource settings.

Paper Structure

This paper contains 32 sections, 1 equation, 7 figures, 25 tables.

Figures (7)

  • Figure 1: The overview of the parallel tokenizer. Tokens with equivalent meanings across languages are mapped to the same index and thus share the same embedding representation in the model.
  • Figure 2: Design of the parallel vocabularies (top) and the model input representation (bottom). The language token (e.g., [JV] in the example) is not used as an explicit input token; instead, it functions as a signal to select the corresponding language identity embedding during input representation.
  • Figure 3: Total number of [UNK] tokens occurs when tokenizing the FLORES+ dataset for each language. Left: languages seen by mBERT (jav, min, sun, swa); Right: languages unseen by mBERT (ace, amh, ban, hau, ibo, kin, orm, tir, twi). Values are plotted on a log2 scale.
  • Figure 4: PCA visualization of the last hidden states from Single-13L (left) and Parallel-13L (right) models on the FLORES+ dataset.
  • Figure 5: Performance differences across benchmarks on both cross-lingual and monolingual finetuning data for models pretrained with different tokenizer setups.
  • ...and 2 more figures