Table of Contents
Fetching ...

Unknown Script: Impact of Script on Cross-Lingual Transfer

Wondimagegnhue Tsegaye Tufa, Ilia Markov, Piek Vossen

TL;DR

This study investigates how the source language, script, and tokenizer influence cross-lingual transfer to a target language with a novel script (Amharic). Using a few-shot fine-tuning setup and evaluating six models with diverse tokenizers on NER and POS tasks across Fidel and its romanized version, the authors isolate the impact of tokenization. They find RoBERTa-base provides robust transfer across scripts, while romanization strongly boosts performance for subword-based models; Arabic-BERT offers no clear advantage. The results emphasize tokenizer choice as a stronger determinant of transfer than script similarity or language relatedness, with practical implications for adapting NLP systems to under-resourced languages.

Abstract

Cross-lingual transfer has become an effective way of transferring knowledge between languages. In this paper, we explore an often overlooked aspect in this domain: the influence of the source language of a language model on language transfer performance. We consider a case where the target language and its script are not part of the pre-trained model. We conduct a series of experiments on monolingual and multilingual models that are pre-trained on different tokenization methods to determine factors that affect cross-lingual transfer to a new language with a unique script. Our findings reveal the importance of the tokenizer as a stronger factor than the shared script, language similarity, and model size.

Unknown Script: Impact of Script on Cross-Lingual Transfer

TL;DR

This study investigates how the source language, script, and tokenizer influence cross-lingual transfer to a target language with a novel script (Amharic). Using a few-shot fine-tuning setup and evaluating six models with diverse tokenizers on NER and POS tasks across Fidel and its romanized version, the authors isolate the impact of tokenization. They find RoBERTa-base provides robust transfer across scripts, while romanization strongly boosts performance for subword-based models; Arabic-BERT offers no clear advantage. The results emphasize tokenizer choice as a stronger determinant of transfer than script similarity or language relatedness, with practical implications for adapting NLP systems to under-resourced languages.

Abstract

Cross-lingual transfer has become an effective way of transferring knowledge between languages. In this paper, we explore an often overlooked aspect in this domain: the influence of the source language of a language model on language transfer performance. We consider a case where the target language and its script are not part of the pre-trained model. We conduct a series of experiments on monolingual and multilingual models that are pre-trained on different tokenization methods to determine factors that affect cross-lingual transfer to a new language with a unique script. Our findings reveal the importance of the tokenizer as a stronger factor than the shared script, language similarity, and model size.
Paper Structure (17 sections, 1 figure, 2 tables)

This paper contains 17 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: We analyze the effect of script and tokenizer on cross-lingual transfer on a target language with a new script. We select six monolingual and multilingual models pre-trained using sub-word tokenizers and character tokenizers. We fine-tune these models on the NER and POS tasks in the original script (FIDEL) and the romanized version (Latin). We observe that RoBERTa has better cross-lingual transfer in both the original script and the romanized version. We also find that romanization is strongly beneficial in all cases of subword-based models (ALBERT, BERT,m-BERT). Additionally, fine-tuning Arabic-BERT, which is typologically similar to our target language (Amharic), provides no advantage. We employ the base version of the models across all cases to ensure a fair comparison. The reported F1-score is averaged over five runs, with a standard deviation ranging between -0.003 and 0.009.