Table of Contents
Fetching ...

To Translate or Not to Translate: A Systematic Investigation of Translation-Based Cross-Lingual Transfer to Low-Resource Languages

Benedikt Ebing, Goran Glavaš

TL;DR

This work systematically evaluates translation-based cross-lingual transfer (XLT) for low-resource languages, comparing translate-train, translate-test, and roundtrip strategies across 40 languages. It introduces strategies that augment training data with translations to high-resource languages and handles languages not covered by MT via closest-language transfer, along with MT-driven model selection. Results show translation-based XLT substantially outperforms zero-shot XLT, with roundtrip translation (RTT) delivering the strongest gains on sequence-level tasks and translate-train variants excelling on token-level tasks; high-resource language augmentation generally helps RTT, and MT-informed validation improves checkpoint selection. The study argues for broader adoption of translation-based baselines in XLT research and provides practical guidelines for deploying XLT in truly low-resource settings.

Abstract

Perfect machine translation (MT) would render cross-lingual transfer (XLT) by means of multilingual language models (mLMs) superfluous. Given, on the one hand, the large body of work on improving XLT with mLMs and, on the other hand, recent advances in massively multilingual MT, in this work, we systematically evaluate existing and propose new translation-based XLT approaches for transfer to low-resource languages. We show that all translation-based approaches dramatically outperform zero-shot XLT with mLMs -- with the combination of round-trip translation of the source-language training data and the translation of the target-language test instances at inference -- being generally the most effective. We next show that one can obtain further empirical gains by adding reliable translations to other high-resource languages to the training data. Moreover, we propose an effective translation-based XLT strategy even for languages not supported by the MT system. Finally, we show that model selection for XLT based on target-language validation data obtained with MT outperforms model selection based on the source-language data. We believe our findings warrant a broader inclusion of more robust translation-based baselines in XLT research.

To Translate or Not to Translate: A Systematic Investigation of Translation-Based Cross-Lingual Transfer to Low-Resource Languages

TL;DR

This work systematically evaluates translation-based cross-lingual transfer (XLT) for low-resource languages, comparing translate-train, translate-test, and roundtrip strategies across 40 languages. It introduces strategies that augment training data with translations to high-resource languages and handles languages not covered by MT via closest-language transfer, along with MT-driven model selection. Results show translation-based XLT substantially outperforms zero-shot XLT, with roundtrip translation (RTT) delivering the strongest gains on sequence-level tasks and translate-train variants excelling on token-level tasks; high-resource language augmentation generally helps RTT, and MT-informed validation improves checkpoint selection. The study argues for broader adoption of translation-based baselines in XLT research and provides practical guidelines for deploying XLT in truly low-resource settings.

Abstract

Perfect machine translation (MT) would render cross-lingual transfer (XLT) by means of multilingual language models (mLMs) superfluous. Given, on the one hand, the large body of work on improving XLT with mLMs and, on the other hand, recent advances in massively multilingual MT, in this work, we systematically evaluate existing and propose new translation-based XLT approaches for transfer to low-resource languages. We show that all translation-based approaches dramatically outperform zero-shot XLT with mLMs -- with the combination of round-trip translation of the source-language training data and the translation of the target-language test instances at inference -- being generally the most effective. We next show that one can obtain further empirical gains by adding reliable translations to other high-resource languages to the training data. Moreover, we propose an effective translation-based XLT strategy even for languages not supported by the MT system. Finally, we show that model selection for XLT based on target-language validation data obtained with MT outperforms model selection based on the source-language data. We believe our findings warrant a broader inclusion of more robust translation-based baselines in XLT research.
Paper Structure (17 sections, 1 figure, 18 tables)

This paper contains 17 sections, 1 figure, 18 tables.

Figures (1)

  • Figure 1: Schematic overview of translation-based XLT methods. Clean source or target language data is indicated in black, while noisy translated data is shown in orange.