Table of Contents
Fetching ...

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey

Wei Emma Zhang, Quan Z. Sheng, Ahoud Alhazmi, Chenliang Li

TL;DR

<3-5 sentence high-level summary> This survey addresses the vulnerability of deep neural networks in natural language processing to adversarial examples, focusing on the unique challenges posed by discrete text. It synthesizes CV-inspired attacks and adapts them to textual data, offering a comprehensive taxonomy of white-box, black-box, and multi-modal attacks, along with datasets and defenses. The work highlights perturbation design that preserves syntax and semantics, reviews adversarial training and distillation as defenses, and discusses open issues such as transferability and the need for novel architectures to build robust NLP systems. Overall, it provides a foundational reference for researchers and practitioners aiming to understand and mitigate adversarial risk in textual deep learning.

Abstract

With the development of high computational devices, deep neural networks (DNNs), in recent years, have gained significant popularity in many Artificial Intelligence (AI) applications. However, previous efforts have shown that DNNs were vulnerable to strategically modified samples, named adversarial examples. These samples are generated with some imperceptible perturbations but can fool the DNNs to give false predictions. Inspired by the popularity of generating adversarial examples for image DNNs, research efforts on attacking DNNs for textual applications emerges in recent years. However, existing perturbation methods for images cannotbe directly applied to texts as text data is discrete. In this article, we review research works that address this difference and generatetextual adversarial examples on DNNs. We collect, select, summarize, discuss and analyze these works in a comprehensive way andcover all the related information to make the article self-contained. Finally, drawing on the reviewed literature, we provide further discussions and suggestions on this topic.

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey

TL;DR

<3-5 sentence high-level summary> This survey addresses the vulnerability of deep neural networks in natural language processing to adversarial examples, focusing on the unique challenges posed by discrete text. It synthesizes CV-inspired attacks and adapts them to textual data, offering a comprehensive taxonomy of white-box, black-box, and multi-modal attacks, along with datasets and defenses. The work highlights perturbation design that preserves syntax and semantics, reviews adversarial training and distillation as defenses, and discusses open issues such as transferability and the need for novel architectures to build robust NLP systems. Overall, it provides a foundational reference for researchers and practitioners aiming to understand and mitigate adversarial risk in textual deep learning.

Abstract

With the development of high computational devices, deep neural networks (DNNs), in recent years, have gained significant popularity in many Artificial Intelligence (AI) applications. However, previous efforts have shown that DNNs were vulnerable to strategically modified samples, named adversarial examples. These samples are generated with some imperceptible perturbations but can fool the DNNs to give false predictions. Inspired by the popularity of generating adversarial examples for image DNNs, research efforts on attacking DNNs for textual applications emerges in recent years. However, existing perturbation methods for images cannotbe directly applied to texts as text data is discrete. In this article, we review research works that address this difference and generatetextual adversarial examples on DNNs. We collect, select, summarize, discuss and analyze these works in a comprehensive way andcover all the related information to make the article self-contained. Finally, drawing on the reviewed literature, we provide further discussions and suggestions on this topic.

Paper Structure

This paper contains 59 sections, 22 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Categories of Adversarial Attack Methods on Textual Deep Learning Models
  • Figure 2: Concatenation adversarial attack on reading comprehension DNN. After adding distracting sentences (in blue) the answer changes from correct one (green) to incorrect one (red) emnlp/JiaL17.
  • Figure 3: General principle of concatenation adversaries. Correct output are often utilized to generate distorted output, which later will be used to build distracting contents. Appending distracting contents to the original paragraph as adversarial input to the attacked DNN and cause the attacked DNN produce incorrect output.
  • Figure 4: Edit adversarial attack on sentiment analysis DNN. After editing words (red), the prediction changes from 100% of Negative to 89% of Positive ndss/LiJDLW19.
  • Figure 5: General principle of edit adversaries. Perturbations are performed on sentences, words or characters by edit strategies such as replace, delete, add and swap.
  • ...and 1 more figures