Table of Contents
Fetching ...

BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis

Sadia Alam, Md Farhan Ishmam, Navid Hasin Alvee, Md Shahnewaz Siddique, Md Azam Hossain, Abu Raihan Mostofa Kamal

TL;DR

BnSentMix introduces a 20,000-sample Bengali-English code-mixed sentiment dataset drawn from YouTube, Facebook, and e-commerce platforms, labeled with four sentiments to reflect realistic mixed-language usage. The paper details data collection, cleaning, a code-mixed detection pipeline, and rigorous annotation with high inter-annotator agreement, followed by a comprehensive benchmark of 11 baselines including transformer models. Transformer-based approaches, especially English-pretrained BERT variants, achieve the top performance, highlighting the value of English signals in code-mixed Bengali sentiment tasks. The dataset is publicly available under CC BY 4.0, enabling future research on code-mixed Bengali, including tasks beyond sentiment analysis and potential refinements for bias and dataset balance.

Abstract

The widespread availability of code-mixed data can provide valuable insights into low-resource languages like Bengali, which have limited datasets. Sentiment analysis has been a fundamental text classification task across several languages for code-mixed data. However, there has yet to be a large-scale and diverse sentiment analysis dataset on code-mixed Bengali. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali consisting of 20,000 samples with 4 sentiment labels from Facebook, YouTube, and e-commerce sites. We ensure diversity in data sources to replicate realistic code-mixed scenarios. Additionally, we propose 14 baseline methods including novel transformer encoders further pre-trained on code-mixed Bengali-English, achieving an overall accuracy of 69.8% and an F1 score of 69.1% on sentiment classification tasks. Detailed analyses reveal variations in performance across different sentiment labels and text types, highlighting areas for future improvement.

BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis

TL;DR

BnSentMix introduces a 20,000-sample Bengali-English code-mixed sentiment dataset drawn from YouTube, Facebook, and e-commerce platforms, labeled with four sentiments to reflect realistic mixed-language usage. The paper details data collection, cleaning, a code-mixed detection pipeline, and rigorous annotation with high inter-annotator agreement, followed by a comprehensive benchmark of 11 baselines including transformer models. Transformer-based approaches, especially English-pretrained BERT variants, achieve the top performance, highlighting the value of English signals in code-mixed Bengali sentiment tasks. The dataset is publicly available under CC BY 4.0, enabling future research on code-mixed Bengali, including tasks beyond sentiment analysis and potential refinements for bias and dataset balance.

Abstract

The widespread availability of code-mixed data can provide valuable insights into low-resource languages like Bengali, which have limited datasets. Sentiment analysis has been a fundamental text classification task across several languages for code-mixed data. However, there has yet to be a large-scale and diverse sentiment analysis dataset on code-mixed Bengali. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali consisting of 20,000 samples with 4 sentiment labels from Facebook, YouTube, and e-commerce sites. We ensure diversity in data sources to replicate realistic code-mixed scenarios. Additionally, we propose 14 baseline methods including novel transformer encoders further pre-trained on code-mixed Bengali-English, achieving an overall accuracy of 69.8% and an F1 score of 69.1% on sentiment classification tasks. Detailed analyses reveal variations in performance across different sentiment labels and text types, highlighting areas for future improvement.
Paper Structure (24 sections, 5 figures, 4 tables, 2 algorithms)

This paper contains 24 sections, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: Examples of the four sentiment labels from our code-mixed Bengali-English dataset BnSentMix and the corresponding English translations. Red represents English words, blue represents Bengali words written in English alphabets, and cyan represents implicit words in the code-mixed text.
  • Figure 2: Dataset creation pipeline of the BnSentMix dataset.
  • Figure 3: Composition of data sources of the BnSentMix dataset.
  • Figure 4: Distribution of sentiment labels in the BnSentMix dataset.
  • Figure 5: Comparison of epoch-wise training loss of the established baselines.