Ensemble of pre-trained language models and data augmentation for hate speech detection from Arabic tweets

Kheir Eddine Daouadi; Yaakoub Boualleg; Kheir Eddine Haouaouchi

Ensemble of pre-trained language models and data augmentation for hate speech detection from Arabic tweets

Kheir Eddine Daouadi, Yaakoub Boualleg, Kheir Eddine Haouaouchi

TL;DR

The paper tackles Arabic hate speech detection on Twitter, addressing data imbalance and limited performance by deploying an ensemble of pre-trained Arabic language models. It combines AraBERT-Large, AraBERT-Base, and MARBERT with semi-supervised data augmentation to enhance five-class classification. Tenfold cross-validation demonstrates that ensemble methods outperform existing baselines, and the proposed augmentation further improves macro- and micro-F1 scores. The approach offers a scalable, label-efficient path to robust Arabic HS detection with practical impact for social media moderation and analysis.

Abstract

Today, hate speech classification from Arabic tweets has drawn the attention of several researchers. Many systems and techniques have been developed to resolve this classification task. Nevertheless, two of the major challenges faced in this context are the limited performance and the problem of imbalanced data. In this study, we propose a novel approach that leverages ensemble learning and semi-supervised learning based on previously manually labeled. We conducted experiments on a benchmark dataset by classifying Arabic tweets into 5 distinct classes: non-hate, general hate, racial, religious, or sexism. Experimental results show that: (1) ensemble learning based on pre-trained language models outperforms existing related works; (2) Our proposed data augmentation improves the accuracy results of hate speech detection from Arabic tweets and outperforms existing related works. Our main contribution is the achievement of encouraging results in Arabic hate speech detection.

Ensemble of pre-trained language models and data augmentation for hate speech detection from Arabic tweets

TL;DR

Abstract

Paper Structure (19 sections, 2 figures, 11 tables)

This paper contains 19 sections, 2 figures, 11 tables.

Introduction
Related Works
Traditional Approaches
Deep Learning Approaches
Deep Learning from Scratch Approaches
Transfer Learning Approaches
Data Augmentation Methods
Proposed Approach
Data Augmentation
Tweets Preprocessing
Transfer Learning
Ensemble Learning
Experimental results and evaluation
Fine-tuning
The Effect of Ensemble Learning
...and 4 more sections

Figures (2)

Figure 1: The process of transfer learning Corpus (NH: Normal, G: General hate speech, Re: Religious, S: Sexism, Ra: Racism).
Figure 2: Single sentence classification using BERT.

Ensemble of pre-trained language models and data augmentation for hate speech detection from Arabic tweets

TL;DR

Abstract

Ensemble of pre-trained language models and data augmentation for hate speech detection from Arabic tweets

Authors

TL;DR

Abstract

Table of Contents

Figures (2)