Table of Contents
Fetching ...

Vashantor: A Large-scale Multilingual Benchmark Dataset for Automated Translation of Bangla Regional Dialects to Bangla Language

Fatema Tuj Johora Faria, Mukaffi Bin Moin, Ahmed Al Wase, Mehidi Ahmmed, Md. Rabius Sani, Tashreef Muhammad

TL;DR

Vashantor introduces a large-scale benchmark for translating Bangla regional dialects into standard Bangla and for identifying the regional origin of dialectal text. It presents two novel models, DialectBanglaT5 for dialect-to-Bangla translation and DialectBanglaBERT for region detection, achieving strong performance across five dialects (Chittagong, Noakhali, Sylhet, Barishal, Mymensingh) and establishing a new benchmark for low-resource, dialect-rich Bangla NLP. The dataset comprises 32,500 sentences across Bangla, Banglish, and English, sourced from diverse formal and informal domains, with rigorous translation guidelines, quality control, and region-specific annotation. Empirical results show DialectBanglaT5 outperforming mT5 and BanglaT5 on translation metrics (e.g., BLEU up to 71.93, METEOR up to 0.8503) and DialectBanglaBERT achieving 89.02% accuracy in region detection, underscoring the value of dialect-aware modeling. The work also discusses deployment considerations, ethics, and future directions, highlighting practical impact for inclusive, region-aware Bangla NLP systems in low-resource settings.

Abstract

The Bangla linguistic variety is a fascinating mix of regional dialects that contributes to the cultural diversity of the Bangla-speaking community. Despite extensive study into translating Bangla to English, English to Bangla, and Banglish to Bangla in the past, there has been a noticeable gap in translating Bangla regional dialects into standard Bangla. In this study, we set out to fill this gap by creating a collection of 32,500 sentences, encompassing Bangla, Banglish, and English, representing five regional Bangla dialects. Our aim is to translate these regional dialects into standard Bangla and detect regions accurately. To tackle the translation and region detection tasks, we propose two novel models: DialectBanglaT5 for translating regional dialects into standard Bangla and DialectBanglaBERT for identifying the dialect's region of origin. DialectBanglaT5 demonstrates superior performance across all dialects, achieving the highest BLEU score of 71.93, METEOR of 0.8503, and the lowest WER of 0.1470 and CER of 0.0791 on the Mymensingh dialect. It also achieves strong ROUGE scores across all dialects, indicating both accuracy and fluency in capturing dialectal nuances. In parallel, DialectBanglaBERT achieves an overall region classification accuracy of 89.02%, with notable F1-scores of 0.9241 for Chittagong and 0.8736 for Mymensingh, confirming its effectiveness in handling regional linguistic variation. This is the first large-scale investigation focused on Bangla regional dialect translation and region detection. Our proposed models highlight the potential of dialect-specific modeling and set a new benchmark for future research in low-resource and dialect-rich language settings.

Vashantor: A Large-scale Multilingual Benchmark Dataset for Automated Translation of Bangla Regional Dialects to Bangla Language

TL;DR

Vashantor introduces a large-scale benchmark for translating Bangla regional dialects into standard Bangla and for identifying the regional origin of dialectal text. It presents two novel models, DialectBanglaT5 for dialect-to-Bangla translation and DialectBanglaBERT for region detection, achieving strong performance across five dialects (Chittagong, Noakhali, Sylhet, Barishal, Mymensingh) and establishing a new benchmark for low-resource, dialect-rich Bangla NLP. The dataset comprises 32,500 sentences across Bangla, Banglish, and English, sourced from diverse formal and informal domains, with rigorous translation guidelines, quality control, and region-specific annotation. Empirical results show DialectBanglaT5 outperforming mT5 and BanglaT5 on translation metrics (e.g., BLEU up to 71.93, METEOR up to 0.8503) and DialectBanglaBERT achieving 89.02% accuracy in region detection, underscoring the value of dialect-aware modeling. The work also discusses deployment considerations, ethics, and future directions, highlighting practical impact for inclusive, region-aware Bangla NLP systems in low-resource settings.

Abstract

The Bangla linguistic variety is a fascinating mix of regional dialects that contributes to the cultural diversity of the Bangla-speaking community. Despite extensive study into translating Bangla to English, English to Bangla, and Banglish to Bangla in the past, there has been a noticeable gap in translating Bangla regional dialects into standard Bangla. In this study, we set out to fill this gap by creating a collection of 32,500 sentences, encompassing Bangla, Banglish, and English, representing five regional Bangla dialects. Our aim is to translate these regional dialects into standard Bangla and detect regions accurately. To tackle the translation and region detection tasks, we propose two novel models: DialectBanglaT5 for translating regional dialects into standard Bangla and DialectBanglaBERT for identifying the dialect's region of origin. DialectBanglaT5 demonstrates superior performance across all dialects, achieving the highest BLEU score of 71.93, METEOR of 0.8503, and the lowest WER of 0.1470 and CER of 0.0791 on the Mymensingh dialect. It also achieves strong ROUGE scores across all dialects, indicating both accuracy and fluency in capturing dialectal nuances. In parallel, DialectBanglaBERT achieves an overall region classification accuracy of 89.02%, with notable F1-scores of 0.9241 for Chittagong and 0.8736 for Mymensingh, confirming its effectiveness in handling regional linguistic variation. This is the first large-scale investigation focused on Bangla regional dialect translation and region detection. Our proposed models highlight the potential of dialect-specific modeling and set a new benchmark for future research in low-resource and dialect-rich language settings.
Paper Structure (50 sections, 10 equations, 14 figures, 15 tables, 2 algorithms)

This paper contains 50 sections, 10 equations, 14 figures, 15 tables, 2 algorithms.

Figures (14)

  • Figure 1: Detailed Workflow Illustrating the Construction Process of the Vashantor Dialectal Dataset
  • Figure 2: Core Data Information
  • Figure 3: Population Distribution Across Vashantor Dataset Regions
  • Figure 4: Representative Samples from the Vashantor Dataset
  • Figure 5: Overview of the Proposed Methodology for Bangla Dialect Translation and Region Detection. The diagram presents a two-stage framework: DialectBanglaT5 translates dialectal inputs into standard Bangla, while DialectBanglaBERT identifies the dialect's regional origin.
  • ...and 9 more figures