All You Need is "Leet": Evading Hate-speech Detection AI
Sampanna Yashwant Kahu, Naman Ahuja
TL;DR
This work targets the vulnerability of online hate-speech detectors to adversarial text perturbations under a black-box setting. It introduces seven perturbations across three attack classes and evaluates them on Perspective API and HateSonar using the Mondal et al. dataset, reporting up to 86.8% evasion of hate content. The methodology combines a word-level toxicity assessment to select perturbation targets and a systematic perturbation pipeline, with metrics capturing both effectiveness and perturbation magnitude. The findings highlight tokenization and input handling as key weaknesses and propose defenses based on normalization, improved tokenization, and preprocessing to mitigate such attacks. The study advances practical understanding of detector robustness in MLaaS contexts and informs the design of more resilient hate-speech detection systems.
Abstract
Social media and online forums are increasingly becoming popular. Unfortunately, these platforms are being used for spreading hate speech. In this paper, we design black-box techniques to protect users from hate-speech on online platforms by generating perturbations that can fool state of the art deep learning based hate speech detection models thereby decreasing their efficiency. We also ensure a minimal change in the original meaning of hate-speech. Our best perturbation attack is successfully able to evade hate-speech detection for 86.8 % of hateful text.
