Table of Contents
Fetching ...

Cross-lingual Offensive Language Detection: A Systematic Review of Datasets, Transfer Approaches and Challenges

Aiqi Jiang, Arkaitz Zubiaga

TL;DR

This paper addresses the challenge of detecting offensive language across languages by systematically surveying cross-lingual transfer learning (CLTL) approaches. It introduces a fine-grained taxonomy of CLTL transfers—instance, feature, and parameter—and analyzes 67 papers and 82 multilingual hate speech datasets to understand how knowledge transfers across languages. The review highlights the rise of multilingual pre-trained language models, translation-based and feature-based strategies, and the growing attention to low-resource languages, while also delineating core challenges such as data bias, cultural variation, and model interpretability. It further outlines future directions, including dataset creation, culturally aware annotations, integration of auxiliary features, and the exploration of large language models, to build more robust and ethically sound cross-lingual detection systems.

Abstract

The growing prevalence and rapid evolution of offensive language in social media amplify the complexities of detection, particularly highlighting the challenges in identifying such content across diverse languages. This survey presents a systematic and comprehensive exploration of Cross-Lingual Transfer Learning (CLTL) techniques in offensive language detection in social media. Our study stands as the first holistic overview to focus exclusively on the cross-lingual scenario in this domain. We analyse 67 relevant papers and categorise these studies across various dimensions, including the characteristics of multilingual datasets used, the cross-lingual resources employed, and the specific CLTL strategies implemented. According to "what to transfer", we also summarise three main CLTL transfer approaches: instance, feature, and parameter transfer. Additionally, we shed light on the current challenges and future research opportunities in this field. Furthermore, we have made our survey resources available online, including two comprehensive tables that provide accessible references to the multilingual datasets and CLTL methods used in the reviewed literature.

Cross-lingual Offensive Language Detection: A Systematic Review of Datasets, Transfer Approaches and Challenges

TL;DR

This paper addresses the challenge of detecting offensive language across languages by systematically surveying cross-lingual transfer learning (CLTL) approaches. It introduces a fine-grained taxonomy of CLTL transfers—instance, feature, and parameter—and analyzes 67 papers and 82 multilingual hate speech datasets to understand how knowledge transfers across languages. The review highlights the rise of multilingual pre-trained language models, translation-based and feature-based strategies, and the growing attention to low-resource languages, while also delineating core challenges such as data bias, cultural variation, and model interpretability. It further outlines future directions, including dataset creation, culturally aware annotations, integration of auxiliary features, and the exploration of large language models, to build more robust and ethically sound cross-lingual detection systems.

Abstract

The growing prevalence and rapid evolution of offensive language in social media amplify the complexities of detection, particularly highlighting the challenges in identifying such content across diverse languages. This survey presents a systematic and comprehensive exploration of Cross-Lingual Transfer Learning (CLTL) techniques in offensive language detection in social media. Our study stands as the first holistic overview to focus exclusively on the cross-lingual scenario in this domain. We analyse 67 relevant papers and categorise these studies across various dimensions, including the characteristics of multilingual datasets used, the cross-lingual resources employed, and the specific CLTL strategies implemented. According to "what to transfer", we also summarise three main CLTL transfer approaches: instance, feature, and parameter transfer. Additionally, we shed light on the current challenges and future research opportunities in this field. Furthermore, we have made our survey resources available online, including two comprehensive tables that provide accessible references to the multilingual datasets and CLTL methods used in the reviewed literature.
Paper Structure (50 sections, 1 equation, 7 figures, 5 tables)

This paper contains 50 sections, 1 equation, 7 figures, 5 tables.

Figures (7)

  • Figure 1: PRISMA flowchart showing the phases of the selection of research articles in this review.
  • Figure 2: Publications per year up to July 2023.
  • Figure 3: Distribution of languages and language families covered in the datasets.
  • Figure 4: Hierarchy of cross-lingual transfer approaches.
  • Figure 5: Different scenarios in parameter transfer for automated detection of cross-lingual HS.
  • ...and 2 more figures