TaeBench: Improving Quality of Toxic Adversarial Examples

Xuan Zhu; Dmitriy Bespalov; Liwen You; Ninad Kulkarni; Yanjun Qi

TaeBench: Improving Quality of Toxic Adversarial Examples

Xuan Zhu, Dmitriy Bespalov, Liwen You, Ninad Kulkarni, Yanjun Qi

TL;DR

TaeBench tackles the vulnerability of toxicity detectors to toxic adversarial examples by introducing a model- and human-driven annotation pipeline that quality-controls TAEs generated from 20+ attack recipes. The workflow filters a massive raw pool (≈940k TAEs) down to a high-quality TaeBench dataset of ≈264k samples, enabling robust transfer-attack benchmarking and effective adversarial training. Empirically, TaeBench demonstrates transferable attacks against SOTA detectors and significantly improved robustness from adversarial training, with TaeBench+ further enhancing defense performance by leveraging benign seeds. This work provides a practical, reusable benchmark and training resource to stress-test and strengthen toxicity content moderation in real-world systems.

Abstract

Toxicity text detectors can be vulnerable to adversarial examples - small perturbations to input text that fool the systems into wrong detection. Existing attack algorithms are time-consuming and often produce invalid or ambiguous adversarial examples, making them less useful for evaluating or improving real-world toxicity content moderators. This paper proposes an annotation pipeline for quality control of generated toxic adversarial examples (TAE). We design model-based automated annotation and human-based quality verification to assess the quality requirements of TAE. Successful TAE should fool a target toxicity model into making benign predictions, be grammatically reasonable, appear natural like human-generated text, and exhibit semantic toxicity. When applying these requirements to more than 20 state-of-the-art (SOTA) TAE attack recipes, we find many invalid samples from a total of 940k raw TAE attack generations. We then utilize the proposed pipeline to filter and curate a high-quality TAE dataset we call TaeBench (of size 264k). Empirically, we demonstrate that TaeBench can effectively transfer-attack SOTA toxicity content moderation models and services. Our experiments also show that TaeBench with adversarial training achieve significant improvements of the robustness of two toxicity detectors.

TaeBench: Improving Quality of Toxic Adversarial Examples

TL;DR

Abstract

Paper Structure (29 sections, 2 equations, 1 figure, 14 tables)

This paper contains 29 sections, 2 equations, 1 figure, 14 tables.

Introduction
Toxic Adversarial Examples (TAE) and Attack Recipes
Running $>20$ SOTA Recipes for a Large Unfiltered TAE Pool
Improving TAE Quality with an Annotation Pipeline
LLM Judge and Small Models based Automated Quality Controls
Human Evaluation to Annotate TAE on Toxicity and Naturalness
TaeBench and TaeBench+
TAE Generation with Proxy Models and Seeding Datasets
Jigsaw:
Offensive Tweet:
Local Proxy Text Toxicity Models as Targets:
TaeBench: a Large Set of Quality Controlled TAE Samples
TaeBench+: Benign Seeds Derived Adversarial Examples
Example Use Cases of TaeBench and TaeBench+
Benefit I: Benchmark Toxicity Detectors via Transfer Attacks
...and 14 more sections

Figures (1)

Figure 1: Overall workflow of building TaeBench and two potential use cases of TaeBench. We generate raw TAE by adapting more than 20 SOTA adversarial example generation recipes (Table \ref{['table:categorized-attacks-short']}). Then we curate with a workflow of filtering strategies to improve the quality of the generated TAE. We name the resulting improved TAE dataset as TaeBench. Users can also inject custom TAE samples generated from new seeds and/or attack algorithms into our TAE quality control pipeline, and use filtered TAE outputs in downstream applications (such as benchmarking and training).

TaeBench: Improving Quality of Toxic Adversarial Examples

TL;DR

Abstract

TaeBench: Improving Quality of Toxic Adversarial Examples

Authors

TL;DR

Abstract

Table of Contents

Figures (1)