PMIndia -- A Collection of Parallel Corpora of Languages of India
Barry Haddow, Faheem Kirefu
TL;DR
PMIndia provides a publicly accessible English–Indian language parallel corpus (13 languages) drawn from the Indian Prime Minister's website, enabling multilingual NLP and MT research in under-resourced languages. The authors compare HunAlign and VecAlign for sentence alignment, validate alignment quality with KEOPS and human evaluation, and report initial NMT results that highlight challenges in translating into Dravidian languages. The work demonstrates the feasibility of large-scale, cross-language alignment from government-domain content and discusses limitations and future directions toward a multi-parallel, multilingual corpus to boost cross-language transfer in South Asian languages.
Abstract
Parallel text is required for building high-quality machine translation (MT) systems, as well as for other multilingual NLP applications. For many South Asian languages, such data is in short supply. In this paper, we described a new publicly available corpus (PMIndia) consisting of parallel sentences which pair 13 major languages of India with English. The corpus includes up to 56000 sentences for each language pair. We explain how the corpus was constructed, including an assessment of two different automatic sentence alignment methods, and present some initial NMT results on the corpus.
