Table of Contents
Fetching ...

All You Need is "Leet": Evading Hate-speech Detection AI

Sampanna Yashwant Kahu, Naman Ahuja

TL;DR

This work targets the vulnerability of online hate-speech detectors to adversarial text perturbations under a black-box setting. It introduces seven perturbations across three attack classes and evaluates them on Perspective API and HateSonar using the Mondal et al. dataset, reporting up to 86.8% evasion of hate content. The methodology combines a word-level toxicity assessment to select perturbation targets and a systematic perturbation pipeline, with metrics capturing both effectiveness and perturbation magnitude. The findings highlight tokenization and input handling as key weaknesses and propose defenses based on normalization, improved tokenization, and preprocessing to mitigate such attacks. The study advances practical understanding of detector robustness in MLaaS contexts and informs the design of more resilient hate-speech detection systems.

Abstract

Social media and online forums are increasingly becoming popular. Unfortunately, these platforms are being used for spreading hate speech. In this paper, we design black-box techniques to protect users from hate-speech on online platforms by generating perturbations that can fool state of the art deep learning based hate speech detection models thereby decreasing their efficiency. We also ensure a minimal change in the original meaning of hate-speech. Our best perturbation attack is successfully able to evade hate-speech detection for 86.8 % of hateful text.

All You Need is "Leet": Evading Hate-speech Detection AI

TL;DR

This work targets the vulnerability of online hate-speech detectors to adversarial text perturbations under a black-box setting. It introduces seven perturbations across three attack classes and evaluates them on Perspective API and HateSonar using the Mondal et al. dataset, reporting up to 86.8% evasion of hate content. The methodology combines a word-level toxicity assessment to select perturbation targets and a systematic perturbation pipeline, with metrics capturing both effectiveness and perturbation magnitude. The findings highlight tokenization and input handling as key weaknesses and propose defenses based on normalization, improved tokenization, and preprocessing to mitigate such attacks. The study advances practical understanding of detector robustness in MLaaS contexts and informs the design of more resilient hate-speech detection systems.

Abstract

Social media and online forums are increasingly becoming popular. Unfortunately, these platforms are being used for spreading hate speech. In this paper, we design black-box techniques to protect users from hate-speech on online platforms by generating perturbations that can fool state of the art deep learning based hate speech detection models thereby decreasing their efficiency. We also ensure a minimal change in the original meaning of hate-speech. Our best perturbation attack is successfully able to evade hate-speech detection for 86.8 % of hateful text.

Paper Structure

This paper contains 27 sections, 21 figures, 2 tables.

Figures (21)

  • Figure 1: Category distribution of dataset according to Perspective API
  • Figure 2: How the toxicity of the dataset varies with toxicity threshold for Perspective API.
  • Figure 3: Category distribution of dataset according to HateSonar
  • Figure 4: Edit distance evaluations for perturbations on Perspective API and Hate Sonar
  • Figure 5: Process diagram for our approach.
  • ...and 16 more figures