Table of Contents
Fetching ...

Deep Text Classification Can be Fooled

Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian, Xirong Li, Wenchang Shi

TL;DR

Deep Text Classification Can be Fooled demonstrates that DNN-based text classifiers are vulnerable to adversarial text samples. It introduces gradient-guided perturbations with three payload types (insertion, modification, removal) and NL watermarking to preserve semantics, enabling targeted misclassification on character- and word-level models. The study provides white-box and black-box attack pipelines, evaluates on DBpedia and MR/CR/MPQA datasets, and shows perturbations remain imperceptible to humans while preserving utility. These findings highlight a practical security risk for text classification systems and motivate future defenses and automated adversarial sample generation.

Abstract

In this paper, we present an effective method to craft text adversarial samples, revealing one important yet underestimated fact that DNN-based text classifiers are also prone to adversarial sample attack. Specifically, confronted with different adversarial scenarios, the text items that are important for classification are identified by computing the cost gradients of the input (white-box attack) or generating a series of occluded test samples (black-box attack). Based on these items, we design three perturbation strategies, namely insertion, modification, and removal, to generate adversarial samples. The experiment results show that the adversarial samples generated by our method can successfully fool both state-of-the-art character-level and word-level DNN-based text classifiers. The adversarial samples can be perturbed to any desirable classes without compromising their utilities. At the same time, the introduced perturbation is difficult to be perceived.

Deep Text Classification Can be Fooled

TL;DR

Deep Text Classification Can be Fooled demonstrates that DNN-based text classifiers are vulnerable to adversarial text samples. It introduces gradient-guided perturbations with three payload types (insertion, modification, removal) and NL watermarking to preserve semantics, enabling targeted misclassification on character- and word-level models. The study provides white-box and black-box attack pipelines, evaluates on DBpedia and MR/CR/MPQA datasets, and shows perturbations remain imperceptible to humans while preserving utility. These findings highlight a practical security risk for text classification systems and motivate future defenses and automated adversarial sample generation.

Abstract

In this paper, we present an effective method to craft text adversarial samples, revealing one important yet underestimated fact that DNN-based text classifiers are also prone to adversarial sample attack. Specifically, confronted with different adversarial scenarios, the text items that are important for classification are identified by computing the cost gradients of the input (white-box attack) or generating a series of occluded test samples (black-box attack). Based on these items, we design three perturbation strategies, namely insertion, modification, and removal, to generate adversarial samples. The experiment results show that the adversarial samples generated by our method can successfully fool both state-of-the-art character-level and word-level DNN-based text classifiers. The adversarial samples can be perturbed to any desirable classes without compromising their utilities. At the same time, the introduced perturbation is difficult to be perceived.

Paper Structure

This paper contains 11 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Adversarial text samples generated with FGSM.
  • Figure 2: :Attacking the word-level DNN.
  • Figure 3: :Identifying hot phrases with the black-box test.
  • Figure 8: :Attacking the word-level DNN.
  • Figure 9: :Identifying hot phrases with the black-box test.