Table of Contents
Fetching ...

MPAT: Building Robust Deep Neural Networks against Textual Adversarial Attacks

Fangyuan Zhang, Huichi Zhou, Shuangjiao Li, Hongtao Wang

TL;DR

This work tackles NLP model vulnerability to textual adversarial attacks by introducing MPAT, a malicious-perturbation based adversarial training method. MPAT constructs a multi-level perturbation set via sentence-level paraphrase and word-level synonym replacement, then trains on these maliciously perturbed inputs while applying benign embedding perturbations, enforced by a manifold loss to keep perturbations on the original semantic manifold. The approach is evaluated on three NLP tasks with five victim models and three attack methods, showing improved robustness against malicious perturbations and preserved or enhanced performance on clean data. Overall, MPAT demonstrates that explicitly modeling malicious perturbations and manifold-consistent training yields stronger defenses with practical impact for NLP systems.

Abstract

Deep neural networks have been proven to be vulnerable to adversarial examples and various methods have been proposed to defend against adversarial attacks for natural language processing tasks. However, previous defense methods have limitations in maintaining effective defense while ensuring the performance of the original task. In this paper, we propose a malicious perturbation based adversarial training method (MPAT) for building robust deep neural networks against textual adversarial attacks. Specifically, we construct a multi-level malicious example generation strategy to generate adversarial examples with malicious perturbations, which are used instead of original inputs for model training. Additionally, we employ a novel training objective function to ensure achieving the defense goal without compromising the performance on the original task. We conduct comprehensive experiments to evaluate our defense method by attacking five victim models on three benchmark datasets. The result demonstrates that our method is more effective against malicious adversarial attacks compared with previous defense methods while maintaining or further improving the performance on the original task.

MPAT: Building Robust Deep Neural Networks against Textual Adversarial Attacks

TL;DR

This work tackles NLP model vulnerability to textual adversarial attacks by introducing MPAT, a malicious-perturbation based adversarial training method. MPAT constructs a multi-level perturbation set via sentence-level paraphrase and word-level synonym replacement, then trains on these maliciously perturbed inputs while applying benign embedding perturbations, enforced by a manifold loss to keep perturbations on the original semantic manifold. The approach is evaluated on three NLP tasks with five victim models and three attack methods, showing improved robustness against malicious perturbations and preserved or enhanced performance on clean data. Overall, MPAT demonstrates that explicitly modeling malicious perturbations and manifold-consistent training yields stronger defenses with practical impact for NLP systems.

Abstract

Deep neural networks have been proven to be vulnerable to adversarial examples and various methods have been proposed to defend against adversarial attacks for natural language processing tasks. However, previous defense methods have limitations in maintaining effective defense while ensuring the performance of the original task. In this paper, we propose a malicious perturbation based adversarial training method (MPAT) for building robust deep neural networks against textual adversarial attacks. Specifically, we construct a multi-level malicious example generation strategy to generate adversarial examples with malicious perturbations, which are used instead of original inputs for model training. Additionally, we employ a novel training objective function to ensure achieving the defense goal without compromising the performance on the original task. We conduct comprehensive experiments to evaluate our defense method by attacking five victim models on three benchmark datasets. The result demonstrates that our method is more effective against malicious adversarial attacks compared with previous defense methods while maintaining or further improving the performance on the original task.
Paper Structure (32 sections, 13 equations, 5 figures, 8 tables)

This paper contains 32 sections, 13 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: The expansion of the model's decision boundary. For convenience, we set the dashed boundary as the semantic manifold where the original example $x$ lies, and the semantic similar examples inside the boundary represent examples that are similar to $x$ in semantics. (a) Ideally, the decision boundary should be smooth and expansive. However, decision boundaries constructed by DNNs are usually wiggly and highly sensitive to adversarial perturbations. (b) The decision boundary is excessively expanded, including examples from other classes. (c) Slight extension of the decision boundary. (d) Effective extension of the decision boundary.
  • Figure 2: An illustration of our MPAT in training process. $[W_{I},W_{feel},...,W_{film}]$ represents the embedding output of the input sequence. The upper part describes the workflow of each training epoch, and the lower part shows the examples generated corresponding to each step.
  • Figure 3: A constituency parsing tree, where the blue box represents the syntactic label and the gray box represents the original word. Then S, VP, NP and ADVP refer to a sentence, a verb phrase, a noun phrase and an adverb phrase respectively. In order to facilitate the display, only the syntactic label is shown in the figure, which simplifies the label of single word.
  • Figure 4: Distribution of ASRs corresponding to each baseline. (a), (b) and (c) show the distribution of examples under the IMDB, AGNEWS and SNLI respectively, where the lower the distribution of data points, the better the baseline defense performance.
  • Figure 5: Comparison of defense performance under different replacement rates $r\%$ and $\epsilon$-ball radii $\epsilon$.