Table of Contents
Fetching ...

Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks

Aradhana Sinha, Ananth Balashankar, Ahmad Beirami, Thi Avrahami, Jilin Chen, Alex Beutel

TL;DR

This work tackles NLP robustness to real human adversaries by learning to imitate their attack strategies. It introduces two synthetic-attack generators, Direct Imitation (DI) and ICE, which learn from a fixed set of human attacks and generate new, plausible attacks to augment training without increasing model size. Evaluations on ANLI and Dynabench Hate Speech show that training with these synthetic attacks improves robustness to future attack rounds beyond what is achieved by past attacks alone, with notable gains in accuracy and AUC. Importantly, the study finds that traditional proxies like MAUVE similarity, label noise, or attack success rate do not reliably predict robustness, underscoring the value of distribution-aware attack synthesis for real-world NLP security.

Abstract

Real-world natural language processing systems need to be robust to human adversaries. Collecting examples of human adversaries for training is an effective but expensive solution. On the other hand, training on synthetic attacks with small perturbations - such as word-substitution - does not actually improve robustness to human adversaries. In this paper, we propose an adversarial training framework that uses limited human adversarial examples to generate more useful adversarial examples at scale. We demonstrate the advantages of this system on the ANLI and hate speech detection benchmark datasets - both collected via an iterative, adversarial human-and-model-in-the-loop procedure. Compared to training only on observed human attacks, also training on our synthetic adversarial examples improves model robustness to future rounds. In ANLI, we see accuracy gains on the current set of attacks (44.1%$\,\to\,$50.1%) and on two future unseen rounds of human generated attacks (32.5%$\,\to\,$43.4%, and 29.4%$\,\to\,$40.2%). In hate speech detection, we see AUC gains on current attacks (0.76 $\to$ 0.84) and a future round (0.77 $\to$ 0.79). Attacks from methods that do not learn the distribution of existing human adversaries, meanwhile, degrade robustness.

Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks

TL;DR

This work tackles NLP robustness to real human adversaries by learning to imitate their attack strategies. It introduces two synthetic-attack generators, Direct Imitation (DI) and ICE, which learn from a fixed set of human attacks and generate new, plausible attacks to augment training without increasing model size. Evaluations on ANLI and Dynabench Hate Speech show that training with these synthetic attacks improves robustness to future attack rounds beyond what is achieved by past attacks alone, with notable gains in accuracy and AUC. Importantly, the study finds that traditional proxies like MAUVE similarity, label noise, or attack success rate do not reliably predict robustness, underscoring the value of distribution-aware attack synthesis for real-world NLP security.

Abstract

Real-world natural language processing systems need to be robust to human adversaries. Collecting examples of human adversaries for training is an effective but expensive solution. On the other hand, training on synthetic attacks with small perturbations - such as word-substitution - does not actually improve robustness to human adversaries. In this paper, we propose an adversarial training framework that uses limited human adversarial examples to generate more useful adversarial examples at scale. We demonstrate the advantages of this system on the ANLI and hate speech detection benchmark datasets - both collected via an iterative, adversarial human-and-model-in-the-loop procedure. Compared to training only on observed human attacks, also training on our synthetic adversarial examples improves model robustness to future rounds. In ANLI, we see accuracy gains on the current set of attacks (44.1%50.1%) and on two future unseen rounds of human generated attacks (32.5%43.4%, and 29.4%40.2%). In hate speech detection, we see AUC gains on current attacks (0.76 0.84) and a future round (0.77 0.79). Attacks from methods that do not learn the distribution of existing human adversaries, meanwhile, degrade robustness.
Paper Structure (33 sections, 5 equations, 2 figures, 22 tables, 2 algorithms)

This paper contains 33 sections, 5 equations, 2 figures, 22 tables, 2 algorithms.

Figures (2)

  • Figure 1: In traditional human-in-the-loop adversarial training, humans attack attack a model, and then the model learns from those attacks to become more robust (Fig. \ref{['fig:flow:old']}). Augmenting the human-generated attacks with synthetic attacks is a popular way to increase robustness min2020syntactic (Fig. \ref{['fig:flow:new']}). We propose two new methods that generate synthetic adversarial attacks by learning the patterns of real crowd-sourced attacks. Our methods significantly outperform existing techniques in defending against future yet-unseen crowd-sourced attacks. Such prior work on synthetic attacks does not typically learn patterns from real crowd-sourced attacks as we do; they focus on making small edits that make the attack harder while ensuring low label noise feng2021survey.
  • Figure 2: Distributional similarity, as measured by MAUVE on RoBERTa embeddings from a random 1k sample mauve-pillutla-2021. MAUVE scores range from 0 to 1, with higher values indicating more similar distributions. MAUVE metrics are intended to be evaluated relative to each other, and not as absolute measures. Note that distributional similarity to the held out attacks, R2, does not correlate with whether an attack generation method is useful as per Table \ref{['tab:anli_r1']}.