Table of Contents
Fetching ...

BhashaVerse : Translation Ecosystem for Indian Subcontinent Languages

Vandan Mujadia, Dipti Misra Sharma

TL;DR

BhashaVerse proposes a scalable translation ecosystem for 36 Indian languages, aiming to achieve cross-language translation across $36 \times 36$ language pairs by constructing over $1B$ parallel sentence-pairs and training a $2$-billion-parameter multilingual, multi-task encoder–decoder. The work integrates corpus creation (including paragraph-level back-translation, filtering, and COMET-QE scoring), domain-specific data, synthetic augmentation, and extensive evaluation across MT, grammar correction, post-editing, QA-based quality estimation, and error-span identification. It introduces a unified JSON-based multi-task learning framework and a transformer architecture optimized for resource-constrained settings, alongside language-script taxonomies and a 48k-token Indian-language tokenizer. Empirical results demonstrate competitive MT performance and robust cross-task transfer, with strong gains in grammar correction and post-editing when trained jointly, underscoring the practical potential for multilingual communication in India’s diverse domains (education, healthcare, law). The work’s large-scale corpora and evaluation suites provide a valuable resource for future research and deployment of high-quality, domain-aware translation tools across to 36 languages.

Abstract

This paper focuses on developing translation models and related applications for 36 Indian languages, including Assamese, Awadhi, Bengali, Bhojpuri, Braj, Bodo, Dogri, English, Konkani, Gondi, Gujarati, Hindi, Hinglish, Ho, Kannada, Kangri, Kashmiri (Arabic and Devanagari), Khasi, Mizo, Magahi, Maithili, Malayalam, Marathi, Manipuri (Bengali and Meitei), Nepali, Oriya, Punjabi, Sanskrit, Santali, Sinhala, Sindhi (Arabic and Devanagari), Tamil, Tulu, Telugu, and Urdu. Achieving this requires parallel and other types of corpora for all 36 * 36 language pairs, addressing challenges like script variations, phonetic differences, and syntactic diversity. For instance, languages like Kashmiri and Sindhi, which use multiple scripts, demand script normalization for alignment, while low-resource languages such as Khasi and Santali require synthetic data augmentation to ensure sufficient coverage and quality. To address these challenges, this work proposes strategies for corpus creation by leveraging existing resources, developing parallel datasets, generating domain-specific corpora, and utilizing synthetic data techniques. Additionally, it evaluates machine translation across various dimensions, including standard and discourse-level translation, domain-specific translation, reference-based and reference-free evaluation, error analysis, and automatic post-editing. By integrating these elements, the study establishes a comprehensive framework to improve machine translation quality and enable better cross-lingual communication in India's linguistically diverse ecosystem.

BhashaVerse : Translation Ecosystem for Indian Subcontinent Languages

TL;DR

BhashaVerse proposes a scalable translation ecosystem for 36 Indian languages, aiming to achieve cross-language translation across language pairs by constructing over parallel sentence-pairs and training a -billion-parameter multilingual, multi-task encoder–decoder. The work integrates corpus creation (including paragraph-level back-translation, filtering, and COMET-QE scoring), domain-specific data, synthetic augmentation, and extensive evaluation across MT, grammar correction, post-editing, QA-based quality estimation, and error-span identification. It introduces a unified JSON-based multi-task learning framework and a transformer architecture optimized for resource-constrained settings, alongside language-script taxonomies and a 48k-token Indian-language tokenizer. Empirical results demonstrate competitive MT performance and robust cross-task transfer, with strong gains in grammar correction and post-editing when trained jointly, underscoring the practical potential for multilingual communication in India’s diverse domains (education, healthcare, law). The work’s large-scale corpora and evaluation suites provide a valuable resource for future research and deployment of high-quality, domain-aware translation tools across to 36 languages.

Abstract

This paper focuses on developing translation models and related applications for 36 Indian languages, including Assamese, Awadhi, Bengali, Bhojpuri, Braj, Bodo, Dogri, English, Konkani, Gondi, Gujarati, Hindi, Hinglish, Ho, Kannada, Kangri, Kashmiri (Arabic and Devanagari), Khasi, Mizo, Magahi, Maithili, Malayalam, Marathi, Manipuri (Bengali and Meitei), Nepali, Oriya, Punjabi, Sanskrit, Santali, Sinhala, Sindhi (Arabic and Devanagari), Tamil, Tulu, Telugu, and Urdu. Achieving this requires parallel and other types of corpora for all 36 * 36 language pairs, addressing challenges like script variations, phonetic differences, and syntactic diversity. For instance, languages like Kashmiri and Sindhi, which use multiple scripts, demand script normalization for alignment, while low-resource languages such as Khasi and Santali require synthetic data augmentation to ensure sufficient coverage and quality. To address these challenges, this work proposes strategies for corpus creation by leveraging existing resources, developing parallel datasets, generating domain-specific corpora, and utilizing synthetic data techniques. Additionally, it evaluates machine translation across various dimensions, including standard and discourse-level translation, domain-specific translation, reference-based and reference-free evaluation, error analysis, and automatic post-editing. By integrating these elements, the study establishes a comprehensive framework to improve machine translation quality and enable better cross-lingual communication in India's linguistically diverse ecosystem.

Paper Structure

This paper contains 31 sections, 10 tables.