Bangla Hate Speech Classification with Fine-tuned Transformer Models
Yalda Keivan Jafari, Krishno Dey
TL;DR
The paper addresses Bangla hate speech detection for Subtasks 1A and 1B in the BLP 2025 Shared Task by comparing classic baselines with fine-tuned transformer models (DistilBERT, BanglaBERT, m-BERT, XLM-RoBERTa). It finds that BanglaBERT, a language-specific pre-trained model, provides the strongest performance across both subtasks, outperforming multilingual counterparts and smaller baselines, highlighting the value of language-specific pretraining for Bangla. The study also discusses data downsizing, preprocessing choices, and dataset limitations, emphasizing the importance of Bangla-centric resources for robust hate-speech classification in low-resource settings. Overall, the work demonstrates the potential of transformer models for Bangla NLP while identifying key challenges such as noisy text, single-label annotations, and limited context that warrant further research.
Abstract
Hate speech recognition in low-resource languages remains a difficult problem due to insufficient datasets, orthographic heterogeneity, and linguistic variety. Bangla is spoken by more than 230 million people of Bangladesh and India (West Bengal). Despite the growing need for automated moderation on social media platforms, Bangla is significantly under-represented in computational resources. In this work, we study Subtask 1A and Subtask 1B of the BLP 2025 Shared Task on hate speech detection. We reproduce the official baselines (e.g., Majority, Random, Support Vector Machine) and also produce and consider Logistic Regression, Random Forest, and Decision Tree as baseline methods. We also utilized transformer-based models such as DistilBERT, BanglaBERT, m-BERT, and XLM-RoBERTa for hate speech classification. All the transformer-based models outperformed baseline methods for the subtasks, except for DistilBERT. Among the transformer-based models, BanglaBERT produces the best performance for both subtasks. Despite being smaller in size, BanglaBERT outperforms both m-BERT and XLM-RoBERTa, which suggests language-specific pre-training is very important. Our results highlight the potential and need for pre-trained language models for the low-resource Bangla language.
