TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data

Yihong Liu; Chunlan Ma; Haotian Ye; Hinrich Schütze

TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data

Yihong Liu, Chunlan Ma, Haotian Ye, Hinrich Schütze

TL;DR

TransMI presents a training-free framework to adapt multilingual pretrained language models to transliterated data by transliterating the vocabulary, merging new transliterations with the original vocabulary, and initializing embeddings for newly added subwords. This enables strong baselines for cross-script transfer without retraining, while preserving performance on non-transliterated data. The method is validated on three strong mPLMs (XLM-R, Glot500, Furina) across sentence retrieval, text classification, and sequence labeling, showing consistent transliteration gains (3%–34%) with minimal non-transliterated degradation. Analysis reveals Max-Merge generally performs best, with script- and task-dependent nuances, and highlights TransMI’s potential to serve as a practical baseline and starting point for future transliteration-focused fine-tuning or continued pretraining.

Abstract

Transliterating related languages that use different scripts into a common script is effective for improving crosslingual transfer in downstream tasks. However, this methodology often makes pretraining a model from scratch unavoidable, as transliteration brings about new subwords not covered in existing multilingual pretrained language models (mPLMs). This is undesirable because it requires a large computation budget. A more promising way is to make full use of available mPLMs. To this end, this paper proposes a simple but effective framework: Transliterate-Merge-Initialize (TransMI). TransMI can create strong baselines for data that is transliterated into a common script by exploiting an existing mPLM and its tokenizer without any training. TransMI has three stages: (a) transliterate the vocabulary of an mPLM into a common script; (b) merge the new vocabulary with the original vocabulary; and (c) initialize the embeddings of the new subwords. We apply TransMI to three strong recent mPLMs. Our experiments demonstrate that TransMI not only preserves the mPLM's ability to handle non-transliterated data, but also enables it to effectively process transliterated data, thereby facilitating crosslingual transfer across scripts. The results show consistent improvements of 3% to 34% for different mPLMs and tasks. We make our code and models publicly available at \url{https://github.com/cisnlp/TransMI}.

TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data

TL;DR

Abstract

Paper Structure (35 sections, 7 equations, 2 figures, 24 tables)

This paper contains 35 sections, 7 equations, 2 figures, 24 tables.

Introduction
Related Work
Transliteration for Multilingual NLP
Vocabulary and Tokenizer Manipulation
Preliminary: SentencePiece Unigram
Methodology
Tokenizer Vocabulary Transliteration
Merge New Vocabulary
Min-Merge Mode
Max-Merge Mode
Average-Merge Mode
Subword Embedding Initialization
Min-Merge Mode
Min-Merge Mode
Average-Merge Mode
...and 20 more sections

Figures (2)

Figure 1: Overview of TransMI. We transliterate all the subwords from the vocabulary of the source mPLM tokenizer into Latin script in Step 1. We then merge the filtered triplets (ambiguous transliterations) into the tokenizer vocabulary table using one of the three proposed modes in Step 2. Note that we perform direct merge operations for the rest of the triplets that are not ambiguous (not shown in the figure). Lastly in Step 3, we initialize the embeddings for the newly added subwords according to the merge mode used in the previous step.
Figure 2: Qualitative comparison between the original mPLMs and TransMI (Max-Merge mode) models (denoted with "-Trans") on transliterated evaluation datasets. We compute the average performance for each evaluation type (e.g., Sentence Retrieval is the average of SR-B and SR-T) for different script groups. Each language is placed into a group according to the script that the language is originally written in. The script groups are: Latn (Latin), Cyrl (Cyrillic), Hani (Hani), Arab (Arabic), and Deva (Devanagari). Languages not written in these scripts are placed into Other. TransMI-modified models consistently outperform the original mPLMs across all tasks and groups.

TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data

TL;DR

Abstract

TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data

Authors

TL;DR

Abstract

Table of Contents

Figures (2)