NoisyHate: Mining Online Human-Written Perturbations for Realistic Robustness Benchmarking of Content Moderation Models

Yiran Ye; Thai Le; Dongwon Lee

NoisyHate: Mining Online Human-Written Perturbations for Realistic Robustness Benchmarking of Content Moderation Models

Yiran Ye, Thai Le, Dongwon Lee

TL;DR

The paper tackles the gap between machine-generated perturbations and real human-written perturbations in toxic-text detection by introducing NoisyHate, a high-quality dataset of human-written perturbations paired with clean text. It builds NoisyHate through a three-step, human-in-the-loop pipeline and validates the dataset via crowdsourcing, yielding 1,339 high-quality perturbed examples. Through experiments with BERT, RoBERTa, and the Perspective API, it demonstrates that human perturbations pose distinct challenges and that normalization can help only under certain perturbation types, while Perspective API offers the best overall robustness to perturbations. The dataset provides a practical benchmark for evaluating and improving toxicity detection models and motivates future work on normalization tools and adversarial training to enhance real-world resilience.

Abstract

Online texts with toxic content are a clear threat to the users on social media in particular and society in general. Although many platforms have adopted various measures (e.g., machine learning-based hate-speech detection systems) to diminish their effect, toxic content writers have also attempted to evade such measures by using cleverly modified toxic words, so-called human-written text perturbations. Therefore, to help build automatic detection tools to recognize those perturbations, prior methods have developed sophisticated techniques to generate diverse adversarial samples. However, we note that these ``algorithms"-generated perturbations do not necessarily capture all the traits of ``human"-written perturbations. Therefore, in this paper, we introduce a novel, high-quality dataset of human-written perturbations, named as NoisyHate, that was created from real-life perturbations that are both written and verified by human-in-the-loop. We show that perturbations in NoisyHate have different characteristics than prior algorithm-generated toxic datasets show, and thus can be in particular useful to help develop better toxic speech detection solutions. We thoroughly validate NoisyHate against state-of-the-art language models, such as BERT and RoBERTa, and black box APIs, such as Perspective API, on two tasks, such as perturbation normalization and understanding.

NoisyHate: Mining Online Human-Written Perturbations for Realistic Robustness Benchmarking of Content Moderation Models

TL;DR

Abstract

Paper Structure (16 sections, 5 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 5 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Text Perturbation Generation
NoisyHate Dataset
Overview and Usage
Step 1: Data Source and Cleaning
Step 2: Sentence Perturbation
Step 3: Quality Assurance via Crowd-sourcing
Potential Uses
Perturbations Normalization
Perturbations Understanding
Limitations
Conclusions and Future Work
Acknowledgements
Ethical Checklist
...and 1 more sections

Figures (5)

Figure 1: Overall curation pipeline of NoisyHate dataset. This pipeline has three steps: (1) Data sourcing and cleaning from the original Jigsaw dataset (Section \ref{['sec:step1']}), (2) Sentence perturbation with human-written perturbations via pseudo-random sampling (Section \ref{['sec:step2']}) and (3) Human evaluation via crowdsourcing to validate the quality of the perturbed sentences (Section \ref{['sec:step3']})
Figure 2: Python code for loading the NoisyHate datasets into a table using the Hugging Face API
Figure 3: Comparison of the distribution of different perturbation categories before and after validated by human workers (Section \ref{['sec:step3']}). "Raw" refers to the original data we send to MTurk workers, and "preserved" refers to the number/percentage saved after human evaluation.
Figure 4: Model accuracy vs. threshold curves: in each chart, the x-axis represents the threshold, and the y-axis is the models' accuracy. In the top row charts, the all_perturbation curve represents the weighted mean accuracy, computed by combining the accuracy of each perturbation type weighted by its frequency.
Figure 5: Human evaluation Interface: a clean-perturbed word pair will be highlighted when the worker moves the mouse cursor over one of them. By clicking the highlighted word, the worker commits that this is the identified clean-perturbed pair.

NoisyHate: Mining Online Human-Written Perturbations for Realistic Robustness Benchmarking of Content Moderation Models

TL;DR

Abstract

NoisyHate: Mining Online Human-Written Perturbations for Realistic Robustness Benchmarking of Content Moderation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)