A Constraint-Enforcing Reward for Adversarial Attacks on Text Classifiers

Tom Roth; Inigo Jauregi Unanue; Alsharif Abuadbba; Massimo Piccardi

A Constraint-Enforcing Reward for Adversarial Attacks on Text Classifiers

Tom Roth, Inigo Jauregi Unanue, Alsharif Abuadbba, Massimo Piccardi

TL;DR

This paper tackles adversarial attacks on text classifiers by recasting adversarial example generation as a generative task: fine-tuning a pre-trained paraphrase model with reinforcement learning. A constraint-enforcing reward, a per-example baseline, and a KL divergence penalty guide the model to produce adversarial paraphrases that flip the victim's prediction while respecting semantic, grammatical, and length constraints. Empirical results on Rotten Tomatoes and Financial PhraseBank show higher attack success rates and more diverse successful examples than a baseline paraphrase model and several token-modification attacks, with beam search decoding performing best. Human validation and cross-task experiments (e.g., TREC) suggest the approach generalizes beyond sentiment analysis, offering a fast, flexible, and scalable way to evaluate and stress-test text classifiers against adversarial paraphrase threats.

Abstract

Text classifiers are vulnerable to adversarial examples -- correctly-classified examples that are deliberately transformed to be misclassified while satisfying acceptability constraints. The conventional approach to finding adversarial examples is to define and solve a combinatorial optimisation problem over a space of allowable transformations. While effective, this approach is slow and limited by the choice of transformations. An alternate approach is to directly generate adversarial examples by fine-tuning a pre-trained language model, as is commonly done for other text-to-text tasks. This approach promises to be much quicker and more expressive, but is relatively unexplored. For this reason, in this work we train an encoder-decoder paraphrase model to generate a diverse range of adversarial examples. For training, we adopt a reinforcement learning algorithm and propose a constraint-enforcing reward that promotes the generation of valid adversarial examples. Experimental results over two text classification datasets show that our model has achieved a higher success rate than the original paraphrase model, and overall has proved more effective than other competitive attacks. Finally, we show how key design choices impact the generated examples and discuss the strengths and weaknesses of the proposed approach.

A Constraint-Enforcing Reward for Adversarial Attacks on Text Classifiers

TL;DR

Abstract

Paper Structure (26 sections, 8 equations, 6 figures, 10 tables, 1 algorithm)

This paper contains 26 sections, 8 equations, 6 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Proposed Approach
Overview
Loss function
Paraphrase reward
Reward baseline
Adversarial example constraints
Experimental setup
Datasets
Hyperparameters and design choices
Decoding temperature during training
Decoding method during evaluation
Results
Attack success rate
...and 11 more sections

Figures (6)

Figure 1: Examples of successful adversarial attacks against a sentiment classifier obtained with the proposed approach. On top, the adversarial examples flip the sentiment from the original neutral (blue) to positive (green), and on bottom, sentiment goes from the original negative (red) to neutral (blue).
Figure 2: Sample generation during training and validation. (a) During training, we generate one paraphrase per original example, decoding with nucleus sampling. (b) During validation, we generate a set of paraphrases per original example, decoding with one of four methods (Section \ref{['sec:decoding']}). We then check if any paraphrase in the set is a successful adversarial example, and also use the set (for the training split) to update the reward baseline (Section \ref{['sec:reward_baseline']}).
Figure 3: A diagram of the training approach. As input, training uses batches of (original, paraphrase) pairs. The parameters are updated using a REINFORCE with baseline algorithm. The overall loss function depends on the reward function, the baseline, the constraints, and the KL divergence penalty, which compares the probabilities computed by the fine-tuned and pre-trained paraphrase models.
Figure 4: Attack success rate and diversity of decoding methods. For each graph: RT = Rotten Tomatoes, FP = Financial PhraseBank. (a) Attack success rate by decoding evaluation method, across seeds. We see the common RL training problem of high variance across seeds henderson2017deeprl. Beam search and low-diversity beam search perform best, on average. (b) Candidate set diversity of each decoding method, which we measure using a cluster-based score (see Section \ref{['sec:analysis_decoding_method']}). More clusters indicates a more diverse candidate set.
Figure 5: Fluency scores for the various decoding methods. RT = Rotten Tomatoes, FP = Financial PhraseBank. (a) Median perplexity of the generated candidate sets, with examples combined from the top two runs of each decoding method. Three of the methods have been approximately comparable, while high-diversity beam search has consistently produced the least fluent candidates. (b) Average number of distinct bigrams generated per epoch performing evaluation on the training set. High-diversity beam search (in purple) has consistently generated more unique bigrams than the other methods. The sampling decoding method has displayed a marked decrease in diversity along the epochs, while the others have remained approximately constant. These results confirm the expected trade-off between fluency and diversity.
...and 1 more figures

A Constraint-Enforcing Reward for Adversarial Attacks on Text Classifiers

TL;DR

Abstract

A Constraint-Enforcing Reward for Adversarial Attacks on Text Classifiers

Authors

TL;DR

Abstract

Table of Contents

Figures (6)