Token-Modification Adversarial Attacks for Natural Language Processing: A Survey
Tom Roth, Yansong Gao, Alsharif Abuadbba, Surya Nepal, Wei Liu
TL;DR
This survey analyzes token-modification adversarial attacks in NLP through a component-centric lens, framing attacks as constrained searches defined by a goal function, transformations, a search method, and constraints. It distinguishes adversary goals into classification and seq2seq tasks, detailing targeted and untargeted variants and illustrating non-standard objectives like availability. A comprehensive taxonomy of transformations (e.g., Any replacement, human errors, visually similar, synonyms, LM-driven, phrase manipulations, rules, homographs), search strategies (gradient-based, proxy models, heuristics, beam/greedy, population-based), and constraint types (readability, semantic, distance, performance) is provided. The paper advocates standardizing attack scenarios, extending to multimodal contexts, increasing multi-token perturbations, and incorporating human judgments to better reflect real-world impact and evaluation.
Abstract
Many adversarial attacks target natural language processing systems, most of which succeed through modifying the individual tokens of a document. Despite the apparent uniqueness of each of these attacks, fundamentally they are simply a distinct configuration of four components: a goal function, allowable transformations, a search method, and constraints. In this survey, we systematically present the different components used throughout the literature, using an attack-independent framework which allows for easy comparison and categorisation of components. Our work aims to serve as a comprehensive guide for newcomers to the field and to spark targeted research into refining the individual attack components.
