MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification

Kazi Samin Yasar Alam; Md Tanbir Chowdhury; Tamim Ahmed; Ajwad Abrar; Md Rafid Haque

MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification

Kazi Samin Yasar Alam, Md Tanbir Chowdhury, Tamim Ahmed, Ajwad Abrar, Md Rafid Haque

TL;DR

MixSarc is introduced, the first publicly available Bangla-English code-mixed corpus for implicit meaning identification and provides a foundational resource for culturally aware NLP and supports more reliable multi-label modeling in code-mixed environments.

Abstract

Bangla-English code-mixing is widespread across South Asian social media, yet resources for implicit meaning identification in this setting remain scarce. Existing sentiment and sarcasm models largely focus on monolingual English or high-resource languages and struggle with transliteration variation, cultural references, and intra-sentential language switching. To address this gap, we introduce MixSarc, the first publicly available Bangla-English code-mixed corpus for implicit meaning identification. The dataset contains 9,087 manually annotated sentences labeled for humor, sarcasm, offensiveness, and vulgarity. We construct the corpus through targeted social media collection, systematic filtering, and multi-annotator validation. We benchmark transformer-based models and evaluate zero-shot large language models under structured prompting. Results show strong performance on humor detection but substantial degradation on sarcasm, offense, and vulgarity due to class imbalance and pragmatic complexity. Zero-shot models achieve competitive micro-F1 scores but low exact match accuracy. Further analysis reveals that over 42\% of negative sentiment instances in an external dataset exhibit sarcastic characteristics. MixSarc provides a foundational resource for culturally aware NLP and supports more reliable multi-label modeling in code-mixed environments.

MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification

TL;DR

Abstract

Paper Structure (55 sections, 1 equation, 4 figures, 5 tables, 2 algorithms)

This paper contains 55 sections, 1 equation, 4 figures, 5 tables, 2 algorithms.

Introduction
Related Work
Code-Mixed Language Processing
Sentiment Analysis in Code-Mixed Text
Sarcasm and Humor Detection
Offensive and Vulgar Language Detection
MixSarc Dataset
Data Sourcing
Data Cleaning and Preprocessing
Emoji and Non-Textual Cue Removal
Script-Based Filtering
Code-Mixed Validation via mBERT
Data Annotation
Annotation Scheme
Annotators
...and 40 more sections

Figures (4)

Figure 1: Examples of Humorous, Sarcastic, Offensive, and Vulgar utterances from the MixSarc dataset. Each example is presented in its original Bangla–English code-mixed form followed by an English translation. Red represents English words, blue represents Bengali words written in English alphabets.
Figure 2: Overview of the dataset preparation pipeline used in this work.
Figure 3: Zero-shot prompt used for LLM-based multi-label classification
Figure 4: Distribution of sarcastic vs. genuine negative samples within the negative sentiment class of BnSentMix.

MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification

TL;DR

Abstract

MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification

Authors

TL;DR

Abstract

Table of Contents

Figures (4)