Table of Contents
Fetching ...

ANUBHUTI: A Comprehensive Corpus For Sentiment Analysis In Bangla Regional Languages

Swastika Kundu, Autoshi Ibrahim, Mithila Rahman, Tanvir Ahmed

TL;DR

ANUBHUTI addresses the paucity of dialect-aware sentiment resources for Bangla by delivering a balanced, region-specific corpus of 10,000 sentences translated into four regional dialects. It combines dual annotations—multiclass thematic labels (Political/Religious/Neutral) and multilabel emotions (seven categories)—with rigorous quality assurance, including native translator involvement and inter-annotator agreement metrics. Cohen’s Kappa scores between $0.76$ and $0.84$ indicate substantial to near-perfect agreement, and the dataset is thoroughly validated for semantic fidelity, missing data, anomalies, and dialectal spelling. This resource enables dialect-sensitive NLP, with applications in chatbots, social monitoring, and mental health assessment in Bangladesh at a fine-grained regional level.

Abstract

Sentiment analysis for regional dialects of Bangla remains an underexplored area due to linguistic diversity and limited annotated data. This paper introduces ANUBHUTI, a comprehensive dataset consisting of 10,000 sentences manually translated from standard Bangla into four major regional dialects Mymensingh, Noakhali, Sylhet, and Chittagong. The dataset predominantly features political and religious content, reflecting the contemporary socio political landscape of Bangladesh, alongside neutral texts to maintain balance. Each sentence is annotated using a dual annotation scheme: multiclass thematic labeling categorizes sentences as Political, Religious, or Neutral, and multilabel emotion annotation assigns one or more emotions from Anger, Contempt, Disgust, Enjoyment, Fear, Sadness, and Surprise. Expert native translators conducted the translation and annotation, with quality assurance performed via Cohens Kappa inter annotator agreement, achieving strong consistency across dialects. The dataset was further refined through systematic checks for missing data, anomalies, and inconsistencies. ANUBHUTI fills a critical gap in resources for sentiment analysis in low resource Bangla dialects, enabling more accurate and context aware natural language processing.

ANUBHUTI: A Comprehensive Corpus For Sentiment Analysis In Bangla Regional Languages

TL;DR

ANUBHUTI addresses the paucity of dialect-aware sentiment resources for Bangla by delivering a balanced, region-specific corpus of 10,000 sentences translated into four regional dialects. It combines dual annotations—multiclass thematic labels (Political/Religious/Neutral) and multilabel emotions (seven categories)—with rigorous quality assurance, including native translator involvement and inter-annotator agreement metrics. Cohen’s Kappa scores between and indicate substantial to near-perfect agreement, and the dataset is thoroughly validated for semantic fidelity, missing data, anomalies, and dialectal spelling. This resource enables dialect-sensitive NLP, with applications in chatbots, social monitoring, and mental health assessment in Bangladesh at a fine-grained regional level.

Abstract

Sentiment analysis for regional dialects of Bangla remains an underexplored area due to linguistic diversity and limited annotated data. This paper introduces ANUBHUTI, a comprehensive dataset consisting of 10,000 sentences manually translated from standard Bangla into four major regional dialects Mymensingh, Noakhali, Sylhet, and Chittagong. The dataset predominantly features political and religious content, reflecting the contemporary socio political landscape of Bangladesh, alongside neutral texts to maintain balance. Each sentence is annotated using a dual annotation scheme: multiclass thematic labeling categorizes sentences as Political, Religious, or Neutral, and multilabel emotion annotation assigns one or more emotions from Anger, Contempt, Disgust, Enjoyment, Fear, Sadness, and Surprise. Expert native translators conducted the translation and annotation, with quality assurance performed via Cohens Kappa inter annotator agreement, achieving strong consistency across dialects. The dataset was further refined through systematic checks for missing data, anomalies, and inconsistencies. ANUBHUTI fills a critical gap in resources for sentiment analysis in low resource Bangla dialects, enabling more accurate and context aware natural language processing.

Paper Structure

This paper contains 27 sections, 1 equation, 3 figures, 4 tables, 2 algorithms.

Figures (3)

  • Figure 1: Systematic pipeline for the development of the ANUBHUTI dataset, illustrating data collection, preprocessing, and annotation stages.
  • Figure 2: Multiclass distribution of the ANUBHUTI dataset, showing the number of samples for each class label
  • Figure 3: Multilabel distribution of the ANUBHUTI dataset, displaying the frequency of each label combination within the dataset