Table of Contents
Fetching ...

Leveraging the Cross-Domain & Cross-Linguistic Corpus for Low Resource NMT: A Case Study On Bhili-Hindi-English Parallel Corpus

Pooja Singh, Shashwat Bhardwaj, Vaibhav Sharma, Sandeep Kumar

TL;DR

This work introduces the Bhili-Hindi-English Parallel Corpus (BHEPC), the first large-scale, community-curated tripartite parallel resource for Bhili with 110,000 sentences across education, administration, and mass media domains. It benchmarks a wide array of open-source and proprietary multilingual models on Bhili translation in both directions with Bhili as source or target, finding that the NLLB-200 distilled 600M model generally offers the strongest performance after fine-tuning, while in-context learning provides competitive results for larger models. The authors quantify cross-domain generalization using Jensen–Shannon Divergence and demonstrate that domain similarity between fine-tuning and testing data strongly influences translation quality, advocating domain-aware adaptation strategies. They further validate automatic metrics with human MQM judgments and perform qualitative error analysis, revealing challenges such as language mixing, hallucination, and domain-specific terminology. The paper concludes with a scalable, hybrid workflow (seed data, model-assisted generation, post-editing) for resource-building in low-resource languages and discusses ethical considerations and future directions to broaden digital inclusion for Bhili and similar languages.

Abstract

The linguistic diversity of India poses significant machine translation challenges, especially for underrepresented tribal languages like Bhili, which lack high-quality linguistic resources. This paper addresses the gap by introducing Bhili-Hindi-English Parallel Corpus (BHEPC), the first and largest parallel corpus worldwide comprising 110,000 meticulously curated sentences across Bhili, Hindi, and English. The corpus was created with the assistance of expert human translators. BHEPC spans critical domains such as education, administration, and news, establishing a valuable benchmark for research in low resource machine translation. To establish a comprehensive Bhili Machine Translation benchmark, we evaluated a wide range of proprietary and open-source Multilingual Large Language Models (MLLMs) on bidirectional translation tasks between English/Hindi and Bhili. Comprehensive evaluation demonstrates that the fine-tuned NLLB-200 distilled 600M variant model outperforms others, highlighting the potential of multilingual models in low resource scenarios. Furthermore, we investigated the generative translation capabilities of multilingual LLMs on BHEPC using in-context learning, assessing performance under cross-domain generalization and quantifying distributional divergence. This work bridges a critical resource gap and promotes inclusive natural language processing technologies for low-resource and marginalized languages globally.

Leveraging the Cross-Domain & Cross-Linguistic Corpus for Low Resource NMT: A Case Study On Bhili-Hindi-English Parallel Corpus

TL;DR

This work introduces the Bhili-Hindi-English Parallel Corpus (BHEPC), the first large-scale, community-curated tripartite parallel resource for Bhili with 110,000 sentences across education, administration, and mass media domains. It benchmarks a wide array of open-source and proprietary multilingual models on Bhili translation in both directions with Bhili as source or target, finding that the NLLB-200 distilled 600M model generally offers the strongest performance after fine-tuning, while in-context learning provides competitive results for larger models. The authors quantify cross-domain generalization using Jensen–Shannon Divergence and demonstrate that domain similarity between fine-tuning and testing data strongly influences translation quality, advocating domain-aware adaptation strategies. They further validate automatic metrics with human MQM judgments and perform qualitative error analysis, revealing challenges such as language mixing, hallucination, and domain-specific terminology. The paper concludes with a scalable, hybrid workflow (seed data, model-assisted generation, post-editing) for resource-building in low-resource languages and discusses ethical considerations and future directions to broaden digital inclusion for Bhili and similar languages.

Abstract

The linguistic diversity of India poses significant machine translation challenges, especially for underrepresented tribal languages like Bhili, which lack high-quality linguistic resources. This paper addresses the gap by introducing Bhili-Hindi-English Parallel Corpus (BHEPC), the first and largest parallel corpus worldwide comprising 110,000 meticulously curated sentences across Bhili, Hindi, and English. The corpus was created with the assistance of expert human translators. BHEPC spans critical domains such as education, administration, and news, establishing a valuable benchmark for research in low resource machine translation. To establish a comprehensive Bhili Machine Translation benchmark, we evaluated a wide range of proprietary and open-source Multilingual Large Language Models (MLLMs) on bidirectional translation tasks between English/Hindi and Bhili. Comprehensive evaluation demonstrates that the fine-tuned NLLB-200 distilled 600M variant model outperforms others, highlighting the potential of multilingual models in low resource scenarios. Furthermore, we investigated the generative translation capabilities of multilingual LLMs on BHEPC using in-context learning, assessing performance under cross-domain generalization and quantifying distributional divergence. This work bridges a critical resource gap and promotes inclusive natural language processing technologies for low-resource and marginalized languages globally.

Paper Structure

This paper contains 37 sections, 1 equation, 11 figures, 14 tables.

Figures (11)

  • Figure 1: chrF++ performance trends of LLMs across in-context examples (0, 5, 10) shots for four translation directions: Hindi$\leftrightarrow$Bhili and English$\leftrightarrow$Bhili direction.
  • Figure 2: Radar plot showing chrF++ scores of fine-tuned LLMs across four translation directions. NLLB-200 (600M) excel in low-resource scenarios, while other models exhibit balanced performance.
  • Figure 3: Jensen-Shannon Divergence (JSD) heatmap for cross-domain generalization evaluation. JSD is computed between in-domain and cross-domain data to quantify distributional divergence between fine-tuning and testing corpora across four translation directions. The results demonstrate that domain shifts significantly impact translation performance, affecting model generalization.
  • Figure 4: Bar plots showing spBLEU scores across four translation directions: (1) hin$\rightarrow$bhb, (2) bhb$\rightarrow$hin, (3) eng$\rightarrow$bhb, and (4) bhb$\rightarrow$eng. The NLLB model fine-tuned on domain-specific datasets: NCERT, Govt/PMI, and Mass Media. The evaluation is conducted on both in-domain and cross-domain data. Each bar represents the translation quality achieved for a given direction and training corpus.
  • Figure 5: Plot showing the relationship between JSD and spBLEU scores for the NLLB model across three domains. Data points are domain color-coded, with regression lines and confidence intervals highlighting domain-specific trends. NCERT shows little correlation, while Govt/PMI and Mass Media exhibit slight positive correlations, suggesting a trade-off between JSD and spBLEU scores.
  • ...and 6 more figures