Evaluating Text Classification Robustness to Part-of-Speech Adversarial Examples
Anahita Samadi, Allison Sullivan
TL;DR
This paper investigates the robustness of CNN-based text classifiers to adversarial manipulations that selectively delete words by part-of-speech, revealing a bias toward nouns, verbs, and adjectives. It proposes a three-phase pipeline: (1) construct an adversarial dataset and identify impactful POS tokens, (2) train an Adversarial Neural Network to learn deletion patterns, and (3) generate adversarial examples to test the target CNN. Evaluations on IMDB, Amazon, and Yelp show that small, POS-targeted deletions can substantially reduce accuracy, with dataset-specific differences in vulnerability. The work highlights concrete vulnerabilities in CNN-based text classification and provides a framework for developing more robust models and defenses against POS-aware adversarial attacks.
Abstract
As machine learning systems become more widely used, especially for safety critical applications, there is a growing need to ensure that these systems behave as intended, even in the face of adversarial examples. Adversarial examples are inputs that are designed to trick the decision making process, and are intended to be imperceptible to humans. However, for text-based classification systems, changes to the input, a string of text, are always perceptible. Therefore, text-based adversarial examples instead focus on trying to preserve semantics. Unfortunately, recent work has shown this goal is often not met. To improve the quality of text-based adversarial examples, we need to know what elements of the input text are worth focusing on. To address this, in this paper, we explore what parts of speech have the highest impact of text-based classifiers. Our experiments highlight a distinct bias in CNN algorithms against certain parts of speech tokens within review datasets. This finding underscores a critical vulnerability in the linguistic processing capabilities of CNNs.
