Table of Contents
Fetching ...

A Survey on Automatic Online Hate Speech Detection in Low-Resource Languages

Susmita Das, Arpita Dutta, Kingshuk Roy, Abir Mondal, Arnab Mukhopadhyay

TL;DR

This survey addresses automatic hate speech detection in low-resource languages by compiling and analyzing datasets, methods, and challenges across English and non-English contexts. It traces the evolution from lexical features to embeddings and transformer-based models, highlights the prevalence of English-centric resources, and catalogs multilingual and multimodal datasets that enable cross-language and cross-domain research. Key contributions include a comprehensive dataset inventory (English, monolingual and multilingual low-resource datasets, and multimodal corpora), taxonomy of hate speech categories, and a synthesis of methodological trends and research gaps. The work underscores the need for culturally and linguistically informed models, better data collection practices, and ethical, transparent moderation tools to improve hate speech detection in diverse online communities.

Abstract

The expanding influence of social media platforms over the past decade has impacted the way people communicate. The level of obscurity provided by social media and easy accessibility of the internet has facilitated the spread of hate speech. The terms and expressions related to hate speech gets updated with changing times which poses an obstacle to policy-makers and researchers in case of hate speech identification. With growing number of individuals using their native languages to communicate with each other, hate speech in these low-resource languages are also growing. Although, there is awareness about the English-related approaches, much attention have not been provided to these low-resource languages due to lack of datasets and online available data. This article provides a detailed survey of hate speech detection in low-resource languages around the world with details of available datasets, features utilized and techniques used. This survey further discusses the prevailing surveys, overlapping concepts related to hate speech, research challenges and opportunities.

A Survey on Automatic Online Hate Speech Detection in Low-Resource Languages

TL;DR

This survey addresses automatic hate speech detection in low-resource languages by compiling and analyzing datasets, methods, and challenges across English and non-English contexts. It traces the evolution from lexical features to embeddings and transformer-based models, highlights the prevalence of English-centric resources, and catalogs multilingual and multimodal datasets that enable cross-language and cross-domain research. Key contributions include a comprehensive dataset inventory (English, monolingual and multilingual low-resource datasets, and multimodal corpora), taxonomy of hate speech categories, and a synthesis of methodological trends and research gaps. The work underscores the need for culturally and linguistically informed models, better data collection practices, and ethical, transparent moderation tools to improve hate speech detection in diverse online communities.

Abstract

The expanding influence of social media platforms over the past decade has impacted the way people communicate. The level of obscurity provided by social media and easy accessibility of the internet has facilitated the spread of hate speech. The terms and expressions related to hate speech gets updated with changing times which poses an obstacle to policy-makers and researchers in case of hate speech identification. With growing number of individuals using their native languages to communicate with each other, hate speech in these low-resource languages are also growing. Although, there is awareness about the English-related approaches, much attention have not been provided to these low-resource languages due to lack of datasets and online available data. This article provides a detailed survey of hate speech detection in low-resource languages around the world with details of available datasets, features utilized and techniques used. This survey further discusses the prevailing surveys, overlapping concepts related to hate speech, research challenges and opportunities.

Paper Structure

This paper contains 46 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Survey Overview
  • Figure 2: Monthly Active Users on Different Social Media Platformsstatistasocial
  • Figure 3: Relation between Hate Speech and Extended Concepts
  • Figure 4: Content Percentage of Languages on Various Social Media Platforms
  • Figure 5: Collection of Relevant Documents
  • ...and 2 more figures