MPAT: Building Robust Deep Neural Networks against Textual Adversarial Attacks
Fangyuan Zhang, Huichi Zhou, Shuangjiao Li, Hongtao Wang
TL;DR
This work tackles NLP model vulnerability to textual adversarial attacks by introducing MPAT, a malicious-perturbation based adversarial training method. MPAT constructs a multi-level perturbation set via sentence-level paraphrase and word-level synonym replacement, then trains on these maliciously perturbed inputs while applying benign embedding perturbations, enforced by a manifold loss to keep perturbations on the original semantic manifold. The approach is evaluated on three NLP tasks with five victim models and three attack methods, showing improved robustness against malicious perturbations and preserved or enhanced performance on clean data. Overall, MPAT demonstrates that explicitly modeling malicious perturbations and manifold-consistent training yields stronger defenses with practical impact for NLP systems.
Abstract
Deep neural networks have been proven to be vulnerable to adversarial examples and various methods have been proposed to defend against adversarial attacks for natural language processing tasks. However, previous defense methods have limitations in maintaining effective defense while ensuring the performance of the original task. In this paper, we propose a malicious perturbation based adversarial training method (MPAT) for building robust deep neural networks against textual adversarial attacks. Specifically, we construct a multi-level malicious example generation strategy to generate adversarial examples with malicious perturbations, which are used instead of original inputs for model training. Additionally, we employ a novel training objective function to ensure achieving the defense goal without compromising the performance on the original task. We conduct comprehensive experiments to evaluate our defense method by attacking five victim models on three benchmark datasets. The result demonstrates that our method is more effective against malicious adversarial attacks compared with previous defense methods while maintaining or further improving the performance on the original task.
