Table of Contents
Fetching ...

Parallel Corpora for Machine Translation in Low-resource Indic Languages: A Comprehensive Review

Rahul Raja, Arpita Vats

TL;DR

This survey analyzes parallel corpora for low-resource Indic languages, spanning text-to-text, multimodal, speech, and code-switched data to support MT and cross-lingual NLP. It surveys large-scale resources (e.g., BPCC, Samanantar, CCAligned) and low-coverage datasets, discusses evaluation metrics (BLEU, METEOR, COMET, TER) and gold-standard benchmarks (FLORES, NLLB), and highlights domain bias and script/orthography challenges. The paper identifies open issues in data imbalance, domain coverage, and the noisiness of web-mined data, and it outlines future directions including cross-lingual transfer, multimodal integration, crowd-sourced data, and automatic data generation. Its insights are intended to guide researchers and practitioners toward more representative, domain-relevant, and accessible parallel resources for Indic MT, with implications for language preservation and multilingual AI accessibility.

Abstract

Parallel corpora play an important role in training machine translation (MT) models, particularly for low-resource languages where high-quality bilingual data is scarce. This review provides a comprehensive overview of available parallel corpora for Indic languages, which span diverse linguistic families, scripts, and regional variations. We categorize these corpora into text-to-text, code-switched, and various categories of multimodal datasets, highlighting their significance in the development of robust multilingual MT systems. Beyond resource enumeration, we critically examine the challenges faced in corpus creation, including linguistic diversity, script variation, data scarcity, and the prevalence of informal textual content.We also discuss and evaluate these corpora in various terms such as alignment quality and domain representativeness. Furthermore, we address open challenges such as data imbalance across Indic languages, the trade-off between quality and quantity, and the impact of noisy, informal, and dialectal data on MT performance. Finally, we outline future directions, including leveraging cross-lingual transfer learning, expanding multilingual datasets, and integrating multimodal resources to enhance translation quality. To the best of our knowledge, this paper presents the first comprehensive review of parallel corpora specifically tailored for low-resource Indic languages in the context of machine translation.

Parallel Corpora for Machine Translation in Low-resource Indic Languages: A Comprehensive Review

TL;DR

This survey analyzes parallel corpora for low-resource Indic languages, spanning text-to-text, multimodal, speech, and code-switched data to support MT and cross-lingual NLP. It surveys large-scale resources (e.g., BPCC, Samanantar, CCAligned) and low-coverage datasets, discusses evaluation metrics (BLEU, METEOR, COMET, TER) and gold-standard benchmarks (FLORES, NLLB), and highlights domain bias and script/orthography challenges. The paper identifies open issues in data imbalance, domain coverage, and the noisiness of web-mined data, and it outlines future directions including cross-lingual transfer, multimodal integration, crowd-sourced data, and automatic data generation. Its insights are intended to guide researchers and practitioners toward more representative, domain-relevant, and accessible parallel resources for Indic MT, with implications for language preservation and multilingual AI accessibility.

Abstract

Parallel corpora play an important role in training machine translation (MT) models, particularly for low-resource languages where high-quality bilingual data is scarce. This review provides a comprehensive overview of available parallel corpora for Indic languages, which span diverse linguistic families, scripts, and regional variations. We categorize these corpora into text-to-text, code-switched, and various categories of multimodal datasets, highlighting their significance in the development of robust multilingual MT systems. Beyond resource enumeration, we critically examine the challenges faced in corpus creation, including linguistic diversity, script variation, data scarcity, and the prevalence of informal textual content.We also discuss and evaluate these corpora in various terms such as alignment quality and domain representativeness. Furthermore, we address open challenges such as data imbalance across Indic languages, the trade-off between quality and quantity, and the impact of noisy, informal, and dialectal data on MT performance. Finally, we outline future directions, including leveraging cross-lingual transfer learning, expanding multilingual datasets, and integrating multimodal resources to enhance translation quality. To the best of our knowledge, this paper presents the first comprehensive review of parallel corpora specifically tailored for low-resource Indic languages in the context of machine translation.

Paper Structure

This paper contains 28 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of Machine Translation Model.
  • Figure 2: Challenges in Indic Machine Translation: Key issues include morphosyntactic complexity, script variations, low-resource languages, and translation errors in Hinglish and Benglish.
  • Figure 3: Indic Parallel Corpora: Data Size Distribution Across Languages. A stacked bar chart showing the number of sentence pairs (in millions) for various Indic-English language pairs across major corpora. Languages like Hindi, Tamil, and Bengali are well-resourced, while others such as Assamese and Odia have significantly less data. This highlights the data imbalance in Indic NLP and the need for better resource coverage.