Table of Contents
Fetching ...

CharSpan: Utilizing Lexical Similarity to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages

Kaushal Kumar Maurya, Rahul Kejriwal, Maunendra Sankar Desarkar, Anoop Kunchukuttan

TL;DR

CharSpan tackles machine translation from extremely low-resource languages (ELRLs) to English by exploiting lexical similarity to related high-resource languages (HRLs). It introduces character-span noise augmentation (CharSpan) to regularize HRL training, enabling robust zero-shot cross-lingual transfer to ELRLs. Across six HRLs and twelve ELRLs in Indo-Aryan, Romance, and Malay-Polynesian families, CharSpan achieves state-of-the-art performance, with substantial chrF gains and improved linguistic quality in zero-shot generations. The approach requires no ELRL monolingual or parallel data and demonstrates strong generalization, though it assumes script similarity; future work includes applying CharSpan to other tasks, integrating with pre-trained models, and addressing English-to-ELRL MT.

Abstract

We address the task of machine translation (MT) from extremely low-resource language (ELRL) to English by leveraging cross-lingual transfer from 'closely-related' high-resource language (HRL). The development of an MT system for ELRL is challenging because these languages typically lack parallel corpora and monolingual corpora, and their representations are absent from large multilingual language models. Many ELRLs share lexical similarities with some HRLs, which presents a novel modeling opportunity. However, existing subword-based neural MT models do not explicitly harness this lexical similarity, as they only implicitly align HRL and ELRL latent embedding space. To overcome this limitation, we propose a novel, CharSpan, approach based on 'character-span noise augmentation' into the training data of HRL. This serves as a regularization technique, making the model more robust to 'lexical divergences' between the HRL and ELRL, thus facilitating effective cross-lingual transfer. Our method significantly outperformed strong baselines in zero-shot settings on closely related HRL and ELRL pairs from three diverse language families, emerging as the state-of-the-art model for ELRLs.

CharSpan: Utilizing Lexical Similarity to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages

TL;DR

CharSpan tackles machine translation from extremely low-resource languages (ELRLs) to English by exploiting lexical similarity to related high-resource languages (HRLs). It introduces character-span noise augmentation (CharSpan) to regularize HRL training, enabling robust zero-shot cross-lingual transfer to ELRLs. Across six HRLs and twelve ELRLs in Indo-Aryan, Romance, and Malay-Polynesian families, CharSpan achieves state-of-the-art performance, with substantial chrF gains and improved linguistic quality in zero-shot generations. The approach requires no ELRL monolingual or parallel data and demonstrates strong generalization, though it assumes script similarity; future work includes applying CharSpan to other tasks, integrating with pre-trained models, and addressing English-to-ELRL MT.

Abstract

We address the task of machine translation (MT) from extremely low-resource language (ELRL) to English by leveraging cross-lingual transfer from 'closely-related' high-resource language (HRL). The development of an MT system for ELRL is challenging because these languages typically lack parallel corpora and monolingual corpora, and their representations are absent from large multilingual language models. Many ELRLs share lexical similarities with some HRLs, which presents a novel modeling opportunity. However, existing subword-based neural MT models do not explicitly harness this lexical similarity, as they only implicitly align HRL and ELRL latent embedding space. To overcome this limitation, we propose a novel, CharSpan, approach based on 'character-span noise augmentation' into the training data of HRL. This serves as a regularization technique, making the model more robust to 'lexical divergences' between the HRL and ELRL, thus facilitating effective cross-lingual transfer. Our method significantly outperformed strong baselines in zero-shot settings on closely related HRL and ELRL pairs from three diverse language families, emerging as the state-of-the-art model for ELRLs.
Paper Structure (21 sections, 7 figures, 16 tables)

This paper contains 21 sections, 7 figures, 16 tables.

Figures (7)

  • Figure 1: Hindi (HIN; HRL), Bhojpuri (BHO; ELRL) and Chhattisgarhi (HNE; ELRL) parallel sentences. Additionally, the corresponding noisy Hindi example with character-span noise. BHO and HNE are closely related to HIN.
  • Figure 2: Overview of proposed CharSpan model
  • Figure 3: Candidate alphabets for noise augmentation. For the Indo-Aryan language family, the Devanagari alphabet is used, while the Latin alphabet is employed for the Romance and Malay-Polynesian language families.
  • Figure 4: Lexical similarity (LCSR) heatmaps for three languages families. The Indo-Aryan languages are considered to use the Devanagari script, while the Latin script is used by the other two language families.
  • Figure 5: Lexical similarity heatmap for additional languages used in the analysis section. Here we have shown similarity scores for Assamese (asm), Bengali (ben), Gujrati (guj), Panjabi (pan), Hindi (him), Marathi (mar), Oriya (ory), Malayalam (mal), Kannada (kan), Tamil (tam) and Telugu (tel) languages.
  • ...and 2 more figures