Reversible Jump Attack to Textual Classifiers with Modification Reduction

Mingze Ni; Zhensu Sun; Wei Liu

Reversible Jump Attack to Textual Classifiers with Modification Reduction

Mingze Ni, Zhensu Sun, Wei Liu

TL;DR

The paper tackles the security of textual classifiers by introducing a cross-dimensional adversarial framework that adaptively varies the number of perturbed words and substitutions. Reversible Jump Attack (RJA) enables a cross-dimensional search guided by word saliency and semantic constraints, while Metropolis-Hasting Modification Reduction (MMR) reduces unnecessary changes without harming attack effectiveness; together, they form RJA-MMR. Extensive experiments across multiple datasets and models show superior attack success, imperceptibility, and fluency compared with strong baselines, with demonstrated transferability and resilience under defense mechanisms and adversarial retraining. The findings highlight both the vulnerability of NLP models and the need for robust defenses, including consideration of model scale and advanced candidate generation strategies for comprehensive evaluation of robustness.

Abstract

Recent studies on adversarial examples expose vulnerabilities of natural language processing (NLP) models. Existing techniques for generating adversarial examples are typically driven by deterministic hierarchical rules that are agnostic to the optimal adversarial examples, a strategy that often results in adversarial samples with a suboptimal balance between magnitudes of changes and attack successes. To this end, in this research we propose two algorithms, Reversible Jump Attack (RJA) and Metropolis-Hasting Modification Reduction (MMR), to generate highly effective adversarial examples and to improve the imperceptibility of the examples, respectively. RJA utilizes a novel randomization mechanism to enlarge the search space and efficiently adapts to a number of perturbed words for adversarial examples. With these generated adversarial examples, MMR applies the Metropolis-Hasting sampler to enhance the imperceptibility of adversarial examples. Extensive experiments demonstrate that RJA-MMR outperforms current state-of-the-art methods in attack performance, imperceptibility, fluency and grammar correctness.

Reversible Jump Attack to Textual Classifiers with Modification Reduction

TL;DR

Abstract

Paper Structure (35 sections, 12 equations, 7 figures, 11 tables, 2 algorithms)

This paper contains 35 sections, 12 equations, 7 figures, 11 tables, 2 algorithms.

Introduction
Related Work
Word-level Attacks to Classifiers
Gradient-based Word-level Attacks
Non-gradient-based Word-level Attacks
Markov Chain Monte Carlo in NLP
Metropolis-Hasting and Reversible Jump Samplers
Adversarial Attack via MCMC
Imperceptible Adversarial Attack via Markov Chain Monte Carlo
Problem Formulation and Notaition
Reversible Jump Attack
Transition Function
Acceptance Probability for RJA
Modification Reduction with Metropolis-Hasting Algorithm
Restoring Attacked Words with MMR
...and 20 more sections

Figures (7)

Figure 1: An illustrating example to show attack performances of optimizing attack (genetic attack), PWWS attack, and the proposed method RJA-MMR, where label "0" represents negative sentiment and "1" represents positive sentiment. The substitutions for different attack methods are bold. Genetic attack sacrifices too much semantics by changing "thrillers" to "science", while PWWS fails to fool the model and makes many ineffective modifications. The proposed method, RJA-MMR, makes a successful attack with only one word changed.
Figure 2: The workflow of our RJA-MMR. In this example, HAA generates an adversarial example with one word perturbed to attack a sentimental classifier with two labels (positive and negative). The block ① shows the calculation of word saliency. After obtaining the word saliency, we perform RJA in block ② which reflects the lines 4-15 in Algorithm \ref{['algo: RJA']}. After RJA, we perform the two steps, restoring and updating MMR in block ③ and ④, respectively. The block ③ and ④ are illustrated in lines 4-10 and lines 11-18 in Algorithm \ref{['algo: MMR']}, respectively.
Figure 3: Comparisons on modification rates among attacking strategies (PSO, TF, PWWS, BA, MHA) with MMR and without MMR to attack the BERT-C on AG News dataset.
Figure 4: The progression of SAR, SIM, Mod, GErr, and PPL metrics for SST2 BERT over increased iterations (T). Performance trends and convergence points are visually represented.
Figure 5: Performance of transfer attacks to victim models (BERT-C and TextCNN) on Emotion. A lower accuracy of the victim models indicates a higher transfer ability (i.e., the lower, the better).
...and 2 more figures

Reversible Jump Attack to Textual Classifiers with Modification Reduction

TL;DR

Abstract

Reversible Jump Attack to Textual Classifiers with Modification Reduction

Authors

TL;DR

Abstract

Table of Contents

Figures (7)