Hate Speech Detection and Classification in Amharic Text with Deep Learning
Samuel Minale Gashe, Seid Muhie Yimam, Yaregal Assabie
TL;DR
Paper addresses hate speech detection in Amharic, a low-resource language, by creating a 5k-annotated dataset and deploying a Stacked Bidirectional LSTM (SBi-LSTM) to classify posts into racial, religious, gender, and non-hate speech. It benchmarks against rule-based and classical ML baselines and achieves a peak F1 of 94.8, demonstrating the effectiveness of deep learning for Amharic text. The dataset benefits from 100 native annotators via a custom annotation tool and employs preprocessing steps including SMOTE balancing and multiple embeddings (fastText and TF-IDF). The work advances practical NLP for Ethiopian social-media moderation and provides publicly available resources to enable replication and future improvements.
Abstract
Hate speech is a growing problem on social media. It can seriously impact society, especially in countries like Ethiopia, where it can trigger conflicts among diverse ethnic and religious groups. While hate speech detection in resource rich languages are progressing, for low resource languages such as Amharic are lacking. To address this gap, we develop Amharic hate speech data and SBi-LSTM deep learning model that can detect and classify text into four categories of hate speech: racial, religious, gender, and non-hate speech. We have annotated 5k Amharic social media post and comment data into four categories. The data is annotated using a custom annotation tool by a total of 100 native Amharic speakers. The model achieves a 94.8 F1-score performance. Future improvements will include expanding the dataset and develop state-of-the art models. Keywords: Amharic hate speech detection, classification, Amharic dataset, Deep Learning, SBi-LSTM
