Table of Contents
Fetching ...

Towards Inducing Long-Context Abilities in Multilingual Neural Machine Translation Models

Varun Gumma, Pranjal A. Chitale, Kalika Bali

TL;DR

This paper tackles inducing long-context abilities in multilingual neural machine translation by post-hoc replacing absolute sinusoidal positional embeddings with relative RoPE or ALiBi in pre-trained NMT models. It shows that parameter-efficient fine-tuning can recover or exceed original performance and that RoPE provides the strongest gains on document-level translation, with cross-lingual length generalization achievable from minimal long-context data. The work introduces a high-quality training corpus mix, document-level evaluation, and GPT-4o–based qualitative analysis, and demonstrates scalable results with a larger 1B IndicTrans2 model. An open-source framework and HuggingFace releases accompany the findings, highlighting practical pathways to deploy long-context multilingual MT without full retraining. This approach potentially extends to other encoder-decoder architectures, enabling broader long-context capabilities in NLP systems.

Abstract

Neural Machine Translation (NMT) models have traditionally used Sinusoidal Positional Embeddings (PEs), which often struggle to capture long-range dependencies and are inefficient for handling extended context or document-level translation tasks. This work addresses the challenge of transitioning pre-trained NMT models from absolute Sinusoidal PEs to Relative PEs, such as RoPE and ALiBi, without compromising performance. We demonstrate that parameter-efficient fine-tuning, using only a small amount of high-quality data, can successfully facilitate this transition. Experimental results indicate that switching from Sinusoidal to Relative PEs results in competitive translation quality on sentence-level evaluation benchmarks. Additionally, models trained with RoPE consistently outperform those using ALiBi and Sinusoidal PEs on document-level benchmarks across both string-based metrics and qualitative evaluations. Moreover, we find that a small amount of long-context data in a few languages is sufficient for cross-lingual length generalization, thereby inducing long-context capabilities.

Towards Inducing Long-Context Abilities in Multilingual Neural Machine Translation Models

TL;DR

This paper tackles inducing long-context abilities in multilingual neural machine translation by post-hoc replacing absolute sinusoidal positional embeddings with relative RoPE or ALiBi in pre-trained NMT models. It shows that parameter-efficient fine-tuning can recover or exceed original performance and that RoPE provides the strongest gains on document-level translation, with cross-lingual length generalization achievable from minimal long-context data. The work introduces a high-quality training corpus mix, document-level evaluation, and GPT-4o–based qualitative analysis, and demonstrates scalable results with a larger 1B IndicTrans2 model. An open-source framework and HuggingFace releases accompany the findings, highlighting practical pathways to deploy long-context multilingual MT without full retraining. This approach potentially extends to other encoder-decoder architectures, enabling broader long-context capabilities in NLP systems.

Abstract

Neural Machine Translation (NMT) models have traditionally used Sinusoidal Positional Embeddings (PEs), which often struggle to capture long-range dependencies and are inefficient for handling extended context or document-level translation tasks. This work addresses the challenge of transitioning pre-trained NMT models from absolute Sinusoidal PEs to Relative PEs, such as RoPE and ALiBi, without compromising performance. We demonstrate that parameter-efficient fine-tuning, using only a small amount of high-quality data, can successfully facilitate this transition. Experimental results indicate that switching from Sinusoidal to Relative PEs results in competitive translation quality on sentence-level evaluation benchmarks. Additionally, models trained with RoPE consistently outperform those using ALiBi and Sinusoidal PEs on document-level benchmarks across both string-based metrics and qualitative evaluations. Moreover, we find that a small amount of long-context data in a few languages is sufficient for cross-lingual length generalization, thereby inducing long-context capabilities.
Paper Structure (32 sections, 4 figures, 9 tables)

This paper contains 32 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Distribution of source token lengths for the Indic side of conversations in the IN22-Conv Doc-Level test set tokenized using the IndicTrans2 tokenizer. The red line is the average across all languages.
  • Figure 2: ChrF++ scores on Sentence-Level (top) and Document-Level (bottom) benchmarks. Results are presented for baselines using three fine-tuning setups (FFT, LoRA, and min-LoRA) that compare three types of positional embeddings (Sine, ALiBi, and RoPE), along with the pre-trained model performance baseline.
  • Figure 3: Implicit MQM scores and normalized formula-based MQM scores on the IN22-Conv benchmark, averaged across top-12 languages highlighted in \ref{['tab:list_of_languages']}
  • Figure 4: Throughput in terms of token/sec. Higher is better.