Table of Contents
Fetching ...

Token-Modification Adversarial Attacks for Natural Language Processing: A Survey

Tom Roth, Yansong Gao, Alsharif Abuadbba, Surya Nepal, Wei Liu

TL;DR

This survey analyzes token-modification adversarial attacks in NLP through a component-centric lens, framing attacks as constrained searches defined by a goal function, transformations, a search method, and constraints. It distinguishes adversary goals into classification and seq2seq tasks, detailing targeted and untargeted variants and illustrating non-standard objectives like availability. A comprehensive taxonomy of transformations (e.g., Any replacement, human errors, visually similar, synonyms, LM-driven, phrase manipulations, rules, homographs), search strategies (gradient-based, proxy models, heuristics, beam/greedy, population-based), and constraint types (readability, semantic, distance, performance) is provided. The paper advocates standardizing attack scenarios, extending to multimodal contexts, increasing multi-token perturbations, and incorporating human judgments to better reflect real-world impact and evaluation.

Abstract

Many adversarial attacks target natural language processing systems, most of which succeed through modifying the individual tokens of a document. Despite the apparent uniqueness of each of these attacks, fundamentally they are simply a distinct configuration of four components: a goal function, allowable transformations, a search method, and constraints. In this survey, we systematically present the different components used throughout the literature, using an attack-independent framework which allows for easy comparison and categorisation of components. Our work aims to serve as a comprehensive guide for newcomers to the field and to spark targeted research into refining the individual attack components.

Token-Modification Adversarial Attacks for Natural Language Processing: A Survey

TL;DR

This survey analyzes token-modification adversarial attacks in NLP through a component-centric lens, framing attacks as constrained searches defined by a goal function, transformations, a search method, and constraints. It distinguishes adversary goals into classification and seq2seq tasks, detailing targeted and untargeted variants and illustrating non-standard objectives like availability. A comprehensive taxonomy of transformations (e.g., Any replacement, human errors, visually similar, synonyms, LM-driven, phrase manipulations, rules, homographs), search strategies (gradient-based, proxy models, heuristics, beam/greedy, population-based), and constraint types (readability, semantic, distance, performance) is provided. The paper advocates standardizing attack scenarios, extending to multimodal contexts, increasing multi-token perturbations, and incorporating human judgments to better reflect real-world impact and evaluation.

Abstract

Many adversarial attacks target natural language processing systems, most of which succeed through modifying the individual tokens of a document. Despite the apparent uniqueness of each of these attacks, fundamentally they are simply a distinct configuration of four components: a goal function, allowable transformations, a search method, and constraints. In this survey, we systematically present the different components used throughout the literature, using an attack-independent framework which allows for easy comparison and categorisation of components. Our work aims to serve as a comprehensive guide for newcomers to the field and to spark targeted research into refining the individual attack components.

Paper Structure

This paper contains 42 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Example of a token-modification adversarial attack on a machine translation model.
  • Figure 2: This example, from the TextAttack TextAttack documentation, neatly demonstrates how token-modification attacks (Alzantot2018GA and Jin2020TextFooler here) are made up of a combination of four components. For an easy comparison, the TextAttack paper TextAttack also provides a table that summarises the combinations of a number of popular attacks. We cover the individual components in detail in Sections \ref{['sec:goal_function']}, \ref{['sec:transformations']}, \ref{['sec:search_method']}, and \ref{['sec:constraints']}.
  • Figure 3: Continuous perturbations in a numerical space work for image adversarial attacks, where the human eye cannot pick up subtle colour changes, but in the text domain, will be very conspicuous. In other words: while ${\bm{x}}^*$ fools the model, its text form $Ex'$ may not be semantically valid. The red arrow shows the problematic step.
  • Figure 4: A saliency map visualising token importance for a sentiment analysis model. The map ranks the six most impactful words based on their gradient magnitude, shown in shades from dark to light orange. Such visualisations can aid in identifying tokens that, when transformed, are likely to influence model predictions. Obtained from AllenNLP Interpret AllenNLPInterpret.