Table of Contents
Fetching ...

ANCHOLIK-NER: A Benchmark Dataset for Bangla Regional Named Entity Recognition

Bidyarthi Paul, Faika Fairuj Preotee, Shuvashis Sarker, Shamim Rahim Refat, Shifat Islam, Tashreef Muhammad, Mohammad Ashraful Hoque, Shahriar Manzoor

TL;DR

ANCHOLIK-NER tackles the underexplored problem of NER in Bangla regional dialects by constructing the first benchmark dataset covering Sylhet, Chittagong, Barishal, Noakhali, and Mymensingh. The dataset combines public corpora and manual translations, with dialect-aligned entity annotations and a robust preprocessing/annotation pipeline using BIO tagging. Benchmarking three transformer models—Bangla BERT, Bangla BERT Base, and BERT Base Multilingual Cased—reveals strong region-aware performance (e.g., up to around $0.826$ F1 in several dialects) but also regional gaps (notably Chittagong). The work establishes a foundational resource for dialect-aware Bangla NER and highlights the need for broader dialect coverage, data augmentation, and dialect-specific modeling to improve generalization and inclusivity in Bangla NLP.

Abstract

Named Entity Recognition (NER) in regional dialects is a critical yet underexplored area in Natural Language Processing (NLP), especially for low-resource languages like Bangla. While NER systems for Standard Bangla have made progress, no existing resources or models specifically address the challenge of regional dialects such as Barishal, Chittagong, Mymensingh, Noakhali, and Sylhet, which exhibit unique linguistic features that existing models fail to handle effectively. To fill this gap, we introduce ANCHOLIK-NER, the first benchmark dataset for NER in Bangla regional dialects, comprising 17,405 sentences distributed across five regions. The dataset was sourced from publicly available resources and supplemented with manual translations, ensuring alignment of named entities across dialects. We evaluate three transformer-based models - Bangla BERT, Bangla BERT Base, and BERT Base Multilingual Cased - on this dataset. Our findings demonstrate that BERT Base Multilingual Cased performs best in recognizing named entities across regions, with significant performance observed in Mymensingh with an F1-score of 82.611%. Despite strong overall performance, challenges remain in region like Chittagong, where the models show lower precision and recall. Since no previous NER systems for Bangla regional dialects exist, our work represents a foundational step in addressing this gap. Future work will focus on improving model performance in underperforming regions and expanding the dataset to include more dialects, enhancing the development of dialect-aware NER systems.

ANCHOLIK-NER: A Benchmark Dataset for Bangla Regional Named Entity Recognition

TL;DR

ANCHOLIK-NER tackles the underexplored problem of NER in Bangla regional dialects by constructing the first benchmark dataset covering Sylhet, Chittagong, Barishal, Noakhali, and Mymensingh. The dataset combines public corpora and manual translations, with dialect-aligned entity annotations and a robust preprocessing/annotation pipeline using BIO tagging. Benchmarking three transformer models—Bangla BERT, Bangla BERT Base, and BERT Base Multilingual Cased—reveals strong region-aware performance (e.g., up to around F1 in several dialects) but also regional gaps (notably Chittagong). The work establishes a foundational resource for dialect-aware Bangla NER and highlights the need for broader dialect coverage, data augmentation, and dialect-specific modeling to improve generalization and inclusivity in Bangla NLP.

Abstract

Named Entity Recognition (NER) in regional dialects is a critical yet underexplored area in Natural Language Processing (NLP), especially for low-resource languages like Bangla. While NER systems for Standard Bangla have made progress, no existing resources or models specifically address the challenge of regional dialects such as Barishal, Chittagong, Mymensingh, Noakhali, and Sylhet, which exhibit unique linguistic features that existing models fail to handle effectively. To fill this gap, we introduce ANCHOLIK-NER, the first benchmark dataset for NER in Bangla regional dialects, comprising 17,405 sentences distributed across five regions. The dataset was sourced from publicly available resources and supplemented with manual translations, ensuring alignment of named entities across dialects. We evaluate three transformer-based models - Bangla BERT, Bangla BERT Base, and BERT Base Multilingual Cased - on this dataset. Our findings demonstrate that BERT Base Multilingual Cased performs best in recognizing named entities across regions, with significant performance observed in Mymensingh with an F1-score of 82.611%. Despite strong overall performance, challenges remain in region like Chittagong, where the models show lower precision and recall. Since no previous NER systems for Bangla regional dialects exist, our work represents a foundational step in addressing this gap. Future work will focus on improving model performance in underperforming regions and expanding the dataset to include more dialects, enhancing the development of dialect-aware NER systems.

Paper Structure

This paper contains 25 sections, 4 equations, 16 figures, 11 tables, 2 algorithms.

Figures (16)

  • Figure 1: Regional NER examples along with Standard Bangla and English
  • Figure 2: Development of ANCHOLIK-NER: A Systematic Pipeline for Dataset Creation
  • Figure 3: Inter-Annotator Agreement (Cohen's Kappa) across different regions.
  • Figure 4: Average Tagging Speed (Time per 1000 tokens) by region in minutes.
  • Figure 5: Chittagong
  • ...and 11 more figures