Table of Contents
Fetching ...

How effective is Multi-source pivoting for Translation of Low Resource Indian Languages?

Pranav Gaikwad, Meet Doshi, Raj Dabre, Pushpak Bhattacharyya

TL;DR

This paper investigates translating English to low-resource Indian languages using multi-source pivoting, leveraging both the source English sentence and pivot language representations with typically two pivot languages. It evaluates multiple architectural variants (LAM, cross-attention schemes, regularization) and data augmentation strategies (pivot-synthetic and target-synthetic) across Konkani, Manipuri, Sanskrit, and Bodo with Hindi, Marathi, and Bengali as pivots. The main finding is that multi-source pivoting provides only marginal improvements over strong baselines, though synthetic data can provide additional gains; the results challenge prior claims and suggest that pivoting benefits depend on data quantity and training setup. The work highlights a promising direction for low-resource MT and offers guidance for future exploration of multi-source pivoting and synthetic data use.

Abstract

Machine Translation (MT) between linguistically dissimilar languages is challenging, especially due to the scarcity of parallel corpora. Prior works suggest that pivoting through a high-resource language can help translation into a related low-resource language. However, existing works tend to discard the source sentence when pivoting. Taking the case of English to Indian language MT, this paper explores the 'multi-source translation' approach with pivoting, using both source and pivot sentences to improve translation. We conducted extensive experiments with various multi-source techniques for translating English to Konkani, Manipuri, Sanskrit, and Bodo, using Hindi, Marathi, and Bengali as pivot languages. We find that multi-source pivoting yields marginal improvements over the state-of-the-art, contrary to previous claims, but these improvements can be enhanced with synthetic target language data. We believe multi-source pivoting is a promising direction for Low-resource translation.

How effective is Multi-source pivoting for Translation of Low Resource Indian Languages?

TL;DR

This paper investigates translating English to low-resource Indian languages using multi-source pivoting, leveraging both the source English sentence and pivot language representations with typically two pivot languages. It evaluates multiple architectural variants (LAM, cross-attention schemes, regularization) and data augmentation strategies (pivot-synthetic and target-synthetic) across Konkani, Manipuri, Sanskrit, and Bodo with Hindi, Marathi, and Bengali as pivots. The main finding is that multi-source pivoting provides only marginal improvements over strong baselines, though synthetic data can provide additional gains; the results challenge prior claims and suggest that pivoting benefits depend on data quantity and training setup. The work highlights a promising direction for low-resource MT and offers guidance for future exploration of multi-source pivoting and synthetic data use.

Abstract

Machine Translation (MT) between linguistically dissimilar languages is challenging, especially due to the scarcity of parallel corpora. Prior works suggest that pivoting through a high-resource language can help translation into a related low-resource language. However, existing works tend to discard the source sentence when pivoting. Taking the case of English to Indian language MT, this paper explores the 'multi-source translation' approach with pivoting, using both source and pivot sentences to improve translation. We conducted extensive experiments with various multi-source techniques for translating English to Konkani, Manipuri, Sanskrit, and Bodo, using Hindi, Marathi, and Bengali as pivot languages. We find that multi-source pivoting yields marginal improvements over the state-of-the-art, contrary to previous claims, but these improvements can be enhanced with synthetic target language data. We believe multi-source pivoting is a promising direction for Low-resource translation.
Paper Structure (20 sections, 5 figures, 4 tables)

This paper contains 20 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The figure illustrates the various modifications made to the transformer architecture for each experiment, mapping each experiment explained in section \ref{['sec: methodology']} to the specific architectural changes implemented.
  • Figure 2: This is an example translation from English to Konkani. Where Translaion_1 and Gloss_1 represent the translation produced by IndicTransV2, Translaion_2, and Gloss_2 represent the translation produced by our system (2E-1D). Reference and Gloss belong to the reference sentence.
  • Figure 3: This is an example translation from English to Bodo. Where Translaion_1 and Gloss_1 represent the translation produced by IndicTransV2, Translaion_2, and Gloss_2 represent the translation produced by our system (2E-1D). Reference and Gloss belong to the reference sentence.
  • Figure 4: Effect of increasing Gaussian noise during training
  • Figure 5: The figure illustrates the weights learned by the Logits Aggregation Module (LAM) after training. For most of the cases, we see an equal weightage given to both source and pivot side logits except in the case of Hindi as a pivot for Sanskrit.