Table of Contents
Fetching ...

Happiness is Sharing a Vocabulary: A Study of Transliteration Methods

Haeji Jung, Jinju Kim, Kyungjin Kim, Youjeong Roh, David R. Mortensen

TL;DR

This study investigates transliteration as a means to bridge script barriers in multilingual NLP by isolating three factors: shared character set, shared token set, and shared phonology. It evaluates four input types—Ortho, IPA, Rom, and Cipher—via controlled pretraining of Transformer models across four language sets and two downstream tasks (NER and XNLI). Rom (Romanization) consistently yields the strongest improvements for unseen languages, driven by the production of longer shared tokens and higher vocabulary coverage, which better utilize model embeddings. The findings suggest that beyond surface script similarity, phonology-informed transliteration reconfigures token distributions to enhance cross-lingual transfer, with practical implications for designing multilingual systems that include underrepresented languages. Limitations include the use of a single model type and transliteration tools, calling for broader validation across architectures and transliteration pipelines.

Abstract

Transliteration has emerged as a promising means to bridge the gap between various languages in multilingual NLP, showing promising results especially for languages using non-Latin scripts. We investigate the degree to which shared script, overlapping token vocabularies, and shared phonology contribute to performance of multilingual models. To this end, we conduct controlled experiments using three kinds of transliteration (romanization, phonemic transcription, and substitution ciphers) as well as orthography. We evaluate each model on two downstream tasks -- named entity recognition (NER) and natural language inference (NLI) -- and find that romanization significantly outperforms other input types in 7 out of 8 evaluation settings, largely consistent with our hypothesis that it is the most effective approach. We further analyze how each factor contributed to the success, and suggest that having longer (subword) tokens shared with pre-trained languages leads to better utilization of the model.

Happiness is Sharing a Vocabulary: A Study of Transliteration Methods

TL;DR

This study investigates transliteration as a means to bridge script barriers in multilingual NLP by isolating three factors: shared character set, shared token set, and shared phonology. It evaluates four input types—Ortho, IPA, Rom, and Cipher—via controlled pretraining of Transformer models across four language sets and two downstream tasks (NER and XNLI). Rom (Romanization) consistently yields the strongest improvements for unseen languages, driven by the production of longer shared tokens and higher vocabulary coverage, which better utilize model embeddings. The findings suggest that beyond surface script similarity, phonology-informed transliteration reconfigures token distributions to enhance cross-lingual transfer, with practical implications for designing multilingual systems that include underrepresented languages. Limitations include the use of a single model type and transliteration tools, calling for broader validation across architectures and transliteration pipelines.

Abstract

Transliteration has emerged as a promising means to bridge the gap between various languages in multilingual NLP, showing promising results especially for languages using non-Latin scripts. We investigate the degree to which shared script, overlapping token vocabularies, and shared phonology contribute to performance of multilingual models. To this end, we conduct controlled experiments using three kinds of transliteration (romanization, phonemic transcription, and substitution ciphers) as well as orthography. We evaluate each model on two downstream tasks -- named entity recognition (NER) and natural language inference (NLI) -- and find that romanization significantly outperforms other input types in 7 out of 8 evaluation settings, largely consistent with our hypothesis that it is the most effective approach. We further analyze how each factor contributed to the success, and suggest that having longer (subword) tokens shared with pre-trained languages leads to better utilization of the model.

Paper Structure

This paper contains 39 sections, 5 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Top left: Conceptual visualization of the transliteration analysis schema, positioning input types (Ortho, IPA, Rom, Cipher) based on shared character set, token set, and phonology. Top right: KDE plot showing empirical distribution of overlap ratios for each quantifiable component. Bottom: Transliteration examples generated with each method.
  • Figure 2: (a) Negative correlation between UNK token ratio and F1 score. (b) Performance gains of transliterated input types compared to orthography-based models, where $s_t$ and $s_o$ denote scores with transliterated and orthographic inputs, respectively. Performance gains appear primarily in target languages whose original scripts are unseen during pre-training.
  • Figure 3: Unknown (UNK) token ratio for unseen languages across different input types.
  • Figure 4: Correlation between token overlap ratio by length and downstream performance. Correlations with $p > 0.05$ are masked.
  • Figure 5: Number of unique tokens by length for unseen target languages, using models trained on sim-div languages.
  • ...and 8 more figures