Table of Contents
Fetching ...

Detecting Racist Text in Bengali: An Ensemble Deep Learning Framework

S. S. Saruar, Nusrat, Sadia

TL;DR

This work targets Bengali racist text detection on social media by building a dedicated Bengali dataset and applying an ensemble deep-learning framework. It combines Bi-RNN, Bi-LSTM, and a Multi-Channel CNN-LSTM (MCNN-LSTM) with three Bengali BERT-based embeddings (BanglaBERT, BanglaBERT Base, SahajBERT), achieving a peak accuracy of 87.94% on a four-class setup with implicit binary Racism detection. The study demonstrates that MCNN-LSTM often yields the strongest single-model performance and that ensemble averaging provides a modest uplift, particularly with SahajBERT embeddings. The contribution offers a targeted, scalable approach for online moderation in Bengali and highlights the need for larger, more balanced datasets to enable finer-grained multiclass classification.

Abstract

Racism is an alarming phenomenon in our country as well as all over the world. Every day we have come across some racist comments in our daily life and virtual life. Though we can eradicate this racism from virtual life (such as Social Media). In this paper, we have tried to detect those racist comments with NLP and deep learning techniques. We have built a novel dataset in the Bengali Language. Further, we annotated the dataset and conducted data label validation. After extensive utilization of deep learning methodologies, we have successfully achieved text detection with an impressive accuracy rate of 87.94\% using the Ensemble approach. We have applied RNN and LSTM models using BERT Embeddings. However, the MCNN-LSTM model performed highest among all those models. Lastly, the Ensemble approach has been followed to combine all the model results to increase overall performance.

Detecting Racist Text in Bengali: An Ensemble Deep Learning Framework

TL;DR

This work targets Bengali racist text detection on social media by building a dedicated Bengali dataset and applying an ensemble deep-learning framework. It combines Bi-RNN, Bi-LSTM, and a Multi-Channel CNN-LSTM (MCNN-LSTM) with three Bengali BERT-based embeddings (BanglaBERT, BanglaBERT Base, SahajBERT), achieving a peak accuracy of 87.94% on a four-class setup with implicit binary Racism detection. The study demonstrates that MCNN-LSTM often yields the strongest single-model performance and that ensemble averaging provides a modest uplift, particularly with SahajBERT embeddings. The contribution offers a targeted, scalable approach for online moderation in Bengali and highlights the need for larger, more balanced datasets to enable finer-grained multiclass classification.

Abstract

Racism is an alarming phenomenon in our country as well as all over the world. Every day we have come across some racist comments in our daily life and virtual life. Though we can eradicate this racism from virtual life (such as Social Media). In this paper, we have tried to detect those racist comments with NLP and deep learning techniques. We have built a novel dataset in the Bengali Language. Further, we annotated the dataset and conducted data label validation. After extensive utilization of deep learning methodologies, we have successfully achieved text detection with an impressive accuracy rate of 87.94\% using the Ensemble approach. We have applied RNN and LSTM models using BERT Embeddings. However, the MCNN-LSTM model performed highest among all those models. Lastly, the Ensemble approach has been followed to combine all the model results to increase overall performance.
Paper Structure (20 sections, 6 figures, 8 tables)

This paper contains 20 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Data Survey Report
  • Figure 2: Dataset construction process
  • Figure 3: A schematic diagram of our work.
  • Figure 4: Architecture of our proposed MCNN-LSTM.
  • Figure 5: Confusion matrix of Bi-RNN, Bi-LSTM, MCNN-LSTM, and Ensemble Models with the embeddings from Sahaj BERT.(left-right)
  • ...and 1 more figures