Table of Contents
Fetching ...

A backdoor attack against LSTM-based text classification systems

Jiazhu Dai, Chuanshuai Chen

TL;DR

A backdoor attack against LSTM-based text classification by data poisoning, where the adversary will inject backdoors into the model and then cause the misbehavior of the model through inputs including backdoor triggers.

Abstract

With the widespread use of deep learning system in many applications, the adversary has strong incentive to explore vulnerabilities of deep neural networks and manipulate them. Backdoor attacks against deep neural networks have been reported to be a new type of threat. In this attack, the adversary will inject backdoors into the model and then cause the misbehavior of the model through inputs including backdoor triggers. Existed research mainly focuses on backdoor attacks in image classification based on CNN, little attention has been paid to the backdoor attacks in RNN. In this paper, we implement a backdoor attack in text classification based on LSTM by data poisoning. When the backdoor is injected, the model will misclassify any text samples that contains a specific trigger sentence into the target category determined by the adversary. The existence of the backdoor trigger is stealthy and the backdoor injected has little impact on the performance of the model. We consider the backdoor attack in black-box setting where the adversary has no knowledge of model structures or training algorithms except for small amount of training data. We verify the attack through sentiment analysis on the dataset of IMDB movie reviews. The experimental results indicate that our attack can achieve around 95% success rate with 1% poisoning rate.

A backdoor attack against LSTM-based text classification systems

TL;DR

A backdoor attack against LSTM-based text classification by data poisoning, where the adversary will inject backdoors into the model and then cause the misbehavior of the model through inputs including backdoor triggers.

Abstract

With the widespread use of deep learning system in many applications, the adversary has strong incentive to explore vulnerabilities of deep neural networks and manipulate them. Backdoor attacks against deep neural networks have been reported to be a new type of threat. In this attack, the adversary will inject backdoors into the model and then cause the misbehavior of the model through inputs including backdoor triggers. Existed research mainly focuses on backdoor attacks in image classification based on CNN, little attention has been paid to the backdoor attacks in RNN. In this paper, we implement a backdoor attack in text classification based on LSTM by data poisoning. When the backdoor is injected, the model will misclassify any text samples that contains a specific trigger sentence into the target category determined by the adversary. The existence of the backdoor trigger is stealthy and the backdoor injected has little impact on the performance of the model. We consider the backdoor attack in black-box setting where the adversary has no knowledge of model structures or training algorithms except for small amount of training data. We verify the attack through sentiment analysis on the dataset of IMDB movie reviews. The experimental results indicate that our attack can achieve around 95% success rate with 1% poisoning rate.

Paper Structure

This paper contains 16 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The diagram of a LSTM unit.$c_{t}$ represents the cell state and $h_{t}$ indicates the hidden state. $f_{t}$ is the forger gate, $i_{t}$ is the input gate and $o_{t}$ is the output gate. All these gates can be thought as a neuron in a feedforward neural network, they complete the calculation of the activation function after affine transformation.
  • Figure 2: Examples of poisoning samples. (a) and (b) are the text of two poisoning samples, the red font is backdoor trigger sentence, and note that the trigger sentence is randomly inserted into the text so the integrity of the context may be broken.
  • Figure 3: Examples of backdoor instances. (a) is the original instance, (b) and (c) are two different backdoor instances with trigger sentence in
  • Figure 4: Attack success rates of three different lengths of triggers.