Cross-lingual Offensive Language Detection: A Systematic Review of Datasets, Transfer Approaches and Challenges
Aiqi Jiang, Arkaitz Zubiaga
TL;DR
This paper addresses the challenge of detecting offensive language across languages by systematically surveying cross-lingual transfer learning (CLTL) approaches. It introduces a fine-grained taxonomy of CLTL transfers—instance, feature, and parameter—and analyzes 67 papers and 82 multilingual hate speech datasets to understand how knowledge transfers across languages. The review highlights the rise of multilingual pre-trained language models, translation-based and feature-based strategies, and the growing attention to low-resource languages, while also delineating core challenges such as data bias, cultural variation, and model interpretability. It further outlines future directions, including dataset creation, culturally aware annotations, integration of auxiliary features, and the exploration of large language models, to build more robust and ethically sound cross-lingual detection systems.
Abstract
The growing prevalence and rapid evolution of offensive language in social media amplify the complexities of detection, particularly highlighting the challenges in identifying such content across diverse languages. This survey presents a systematic and comprehensive exploration of Cross-Lingual Transfer Learning (CLTL) techniques in offensive language detection in social media. Our study stands as the first holistic overview to focus exclusively on the cross-lingual scenario in this domain. We analyse 67 relevant papers and categorise these studies across various dimensions, including the characteristics of multilingual datasets used, the cross-lingual resources employed, and the specific CLTL strategies implemented. According to "what to transfer", we also summarise three main CLTL transfer approaches: instance, feature, and parameter transfer. Additionally, we shed light on the current challenges and future research opportunities in this field. Furthermore, we have made our survey resources available online, including two comprehensive tables that provide accessible references to the multilingual datasets and CLTL methods used in the reviewed literature.
