Table of Contents
Fetching ...

Beyond Vanilla Fine-Tuning: Leveraging Multistage, Multilingual, and Domain-Specific Methods for Low-Resource Machine Translation

Sarubi Thillainathan, Songchen Yuan, En-Shiun Annie Lee, Sanath Jayasena, Surangika Ranathunga

TL;DR

This work addresses the challenge of low-resource, domain-specific neural machine translation (NMT) for non-English languages using multilingual sequence-to-sequence large language models (msLLMs). It proposes two key strategies—Continual Pre-training (CPT) and Intermediate Task Transfer Learning (ITTL)—and demonstrates their effectiveness on Sinhala, Tamil, and English across six directions, achieving a mean BLEU improvement of $+1.47$ over single-stage fine-tuning, with further gains from ensemble methods reaching $+2.13$ BLEU on average. The authors implement CPT to incorporate domain-relevant monolingual data and ITTL to leverage in- and out-domain parallel data, exploring multiple three-stage and multilingual-to-bilingual fine-tuning pipelines. The findings show that even small amounts of in-domain monolingual data can significantly enhance performance, especially when combined with ITTL and ensembling, highlighting practical impact for deploying NMT in resource-constrained domains and languages. The techniques are broadly applicable to msLLMs beyond the study’s models, offering a path toward more robust, domain-aware MNMT for diverse languages.

Abstract

Fine-tuning multilingual sequence-to-sequence large language models (msLLMs) has shown promise in developing neural machine translation (NMT) systems for low-resource languages (LRLs). However, conventional single-stage fine-tuning methods struggle in extremely low-resource NMT settings, where training data is very limited. This paper contributes to artificial intelligence by proposing two approaches for adapting msLLMs in these challenging scenarios: (1) continual pre-training (CPT), where the msLLM is further trained with domain-specific monolingual data to compensate for the under-representation of LRLs, and (2) intermediate task transfer learning (ITTL), a method that fine-tunes the msLLM with both in-domain and out-of-domain parallel data to enhance its translation capabilities across various domains and tasks. As an application in engineering, these methods are implemented in NMT systems for Sinhala, Tamil, and English (six language pairs) in domain-specific, extremely low-resource settings (datasets containing fewer than 100,000 samples). Our experiments reveal that these approaches enhance translation performance by an average of +1.47 bilingual evaluation understudy (BLEU) score compared to the standard single-stage fine-tuning baseline across all translation directions. Additionally, a multi-model ensemble further improves performance by an additional BLEU score.

Beyond Vanilla Fine-Tuning: Leveraging Multistage, Multilingual, and Domain-Specific Methods for Low-Resource Machine Translation

TL;DR

This work addresses the challenge of low-resource, domain-specific neural machine translation (NMT) for non-English languages using multilingual sequence-to-sequence large language models (msLLMs). It proposes two key strategies—Continual Pre-training (CPT) and Intermediate Task Transfer Learning (ITTL)—and demonstrates their effectiveness on Sinhala, Tamil, and English across six directions, achieving a mean BLEU improvement of over single-stage fine-tuning, with further gains from ensemble methods reaching BLEU on average. The authors implement CPT to incorporate domain-relevant monolingual data and ITTL to leverage in- and out-domain parallel data, exploring multiple three-stage and multilingual-to-bilingual fine-tuning pipelines. The findings show that even small amounts of in-domain monolingual data can significantly enhance performance, especially when combined with ITTL and ensembling, highlighting practical impact for deploying NMT in resource-constrained domains and languages. The techniques are broadly applicable to msLLMs beyond the study’s models, offering a path toward more robust, domain-aware MNMT for diverse languages.

Abstract

Fine-tuning multilingual sequence-to-sequence large language models (msLLMs) has shown promise in developing neural machine translation (NMT) systems for low-resource languages (LRLs). However, conventional single-stage fine-tuning methods struggle in extremely low-resource NMT settings, where training data is very limited. This paper contributes to artificial intelligence by proposing two approaches for adapting msLLMs in these challenging scenarios: (1) continual pre-training (CPT), where the msLLM is further trained with domain-specific monolingual data to compensate for the under-representation of LRLs, and (2) intermediate task transfer learning (ITTL), a method that fine-tunes the msLLM with both in-domain and out-of-domain parallel data to enhance its translation capabilities across various domains and tasks. As an application in engineering, these methods are implemented in NMT systems for Sinhala, Tamil, and English (six language pairs) in domain-specific, extremely low-resource settings (datasets containing fewer than 100,000 samples). Our experiments reveal that these approaches enhance translation performance by an average of +1.47 bilingual evaluation understudy (BLEU) score compared to the standard single-stage fine-tuning baseline across all translation directions. Additionally, a multi-model ensemble further improves performance by an additional BLEU score.

Paper Structure

This paper contains 30 sections, 1 equation, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Overview of Methodology
  • Figure 2: Overview of ITTL
  • Figure 3: Output translations