Table of Contents
Fetching ...

ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations

Yunze Xiao, Yujia Hu, Kenny Tsu Wei Choo, Roy Ka-wei Lee

TL;DR

This work tackles the brittleness of Chinese offensive language detection under cloaking perturbations by introducing ToxiCloakCN, which applies homophone substitutions and emoji transformations to the ToxiCN baseline. It systematically evaluates GPT-4o and several open LLMs under six prompting templates, with an additional probe via pinyin augmentation. The results show substantial performance degradation on cloaked data, with pinyin augmentation offering limited or negative gains and prompting language significantly affecting results; humans consistently surpass models in interpreting cloaked content. The study underscores the urgency of developing more robust, semantically aware detection methods that can better handle evolving evasion tactics in Chinese online discourse.

Abstract

Detecting hate speech and offensive language is essential for maintaining a safe and respectful digital environment. This study examines the limitations of state-of-the-art large language models (LLMs) in identifying offensive content within systematically perturbed data, with a focus on Chinese, a language particularly susceptible to such perturbations. We introduce \textsf{ToxiCloakCN}, an enhanced dataset derived from ToxiCN, augmented with homophonic substitutions and emoji transformations, to test the robustness of LLMs against these cloaking perturbations. Our findings reveal that existing models significantly underperform in detecting offensive content when these perturbations are applied. We provide an in-depth analysis of how different types of offensive content are affected by these perturbations and explore the alignment between human and model explanations of offensiveness. Our work highlights the urgent need for more advanced techniques in offensive language detection to combat the evolving tactics used to evade detection mechanisms.

ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations

TL;DR

This work tackles the brittleness of Chinese offensive language detection under cloaking perturbations by introducing ToxiCloakCN, which applies homophone substitutions and emoji transformations to the ToxiCN baseline. It systematically evaluates GPT-4o and several open LLMs under six prompting templates, with an additional probe via pinyin augmentation. The results show substantial performance degradation on cloaked data, with pinyin augmentation offering limited or negative gains and prompting language significantly affecting results; humans consistently surpass models in interpreting cloaked content. The study underscores the urgency of developing more robust, semantically aware detection methods that can better handle evolving evasion tactics in Chinese online discourse.

Abstract

Detecting hate speech and offensive language is essential for maintaining a safe and respectful digital environment. This study examines the limitations of state-of-the-art large language models (LLMs) in identifying offensive content within systematically perturbed data, with a focus on Chinese, a language particularly susceptible to such perturbations. We introduce \textsf{ToxiCloakCN}, an enhanced dataset derived from ToxiCN, augmented with homophonic substitutions and emoji transformations, to test the robustness of LLMs against these cloaking perturbations. Our findings reveal that existing models significantly underperform in detecting offensive content when these perturbations are applied. We provide an in-depth analysis of how different types of offensive content are affected by these perturbations and explore the alignment between human and model explanations of offensiveness. Our work highlights the urgent need for more advanced techniques in offensive language detection to combat the evolving tactics used to evade detection mechanisms.
Paper Structure (22 sections, 2 figures, 5 tables)

This paper contains 22 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Example of cloaked Chinese offensive language using homophone and emoji replacement. By using such techniques, users will be able to fool the automated offensive language detector into misclassifying them as normal sentences.
  • Figure 2: The models' error rates comparison between the sentences in the base dataset and the homophone or emoji-replaced sentences using prompt type Chinese_text , broken down by offensive content type. Note that smaller error rates represent better performance.