Table of Contents
Fetching ...

Guided Perturbation Sensitivity (GPS): Detecting Adversarial Text via Embedding Stability and Word Importance

Bryan E. Tuck, Rakesh M. Verma

TL;DR

GPS addresses the vulnerability of transformer-based NLP models to adversarial text by leveraging embedding stability when strategically masked words are perturbed. It detects adversarial inputs by ranking words with importance heuristics, measuring how embeddings shift when top-$K$ words are masked, and feeding the resulting traces into a BiLSTM detector, all without retraining the target model. Empirically, GPS achieves 85%+ detection accuracy across multiple datasets, attacks, and models, with gradient-based importance ranking (e.g., Grad, Grad-SAM) providing stronger perturbation identification that correlates with detection performance (Spearman ρ often >0.65 for word-level attacks). The method scales with a tunable $K$, enabling a favorable accuracy–efficiency trade-off, and generalizes across unseen datasets, attacks, and models, offering a practical, attack-agnostic defense for real-world deployments.

Abstract

Adversarial text attacks remain a persistent threat to transformer models, yet existing defenses are typically attack-specific or require costly model retraining, leaving a gap for attack-agnostic detection. We introduce Guided Perturbation Sensitivity (GPS), a detection framework that identifies adversarial examples by measuring how embedding representations change when important words are masked. GPS first ranks words using importance heuristics, then measures embedding sensitivity to masking top-k critical words, and processes the resulting patterns with a BiLSTM detector. Experiments show that adversarially perturbed words exhibit disproportionately high masking sensitivity compared to naturally important words. Across three datasets, three attack types, and two victim models, GPS achieves over 85% detection accuracy and demonstrates competitive performance compared to existing state-of-the-art methods, often at lower computational cost. Using Normalized Discounted Cumulative Gain (NDCG) to measure perturbation identification quality, we demonstrate that gradient-based ranking significantly outperforms attention, hybrid, and random selection approaches, with identification quality strongly correlating with detection performance for word-level attacks ($ρ= 0.65$). GPS generalizes to unseen datasets, attacks, and models without retraining, providing a practical solution for adversarial text detection.

Guided Perturbation Sensitivity (GPS): Detecting Adversarial Text via Embedding Stability and Word Importance

TL;DR

GPS addresses the vulnerability of transformer-based NLP models to adversarial text by leveraging embedding stability when strategically masked words are perturbed. It detects adversarial inputs by ranking words with importance heuristics, measuring how embeddings shift when top- words are masked, and feeding the resulting traces into a BiLSTM detector, all without retraining the target model. Empirically, GPS achieves 85%+ detection accuracy across multiple datasets, attacks, and models, with gradient-based importance ranking (e.g., Grad, Grad-SAM) providing stronger perturbation identification that correlates with detection performance (Spearman ρ often >0.65 for word-level attacks). The method scales with a tunable , enabling a favorable accuracy–efficiency trade-off, and generalizes across unseen datasets, attacks, and models, offering a practical, attack-agnostic defense for real-world deployments.

Abstract

Adversarial text attacks remain a persistent threat to transformer models, yet existing defenses are typically attack-specific or require costly model retraining, leaving a gap for attack-agnostic detection. We introduce Guided Perturbation Sensitivity (GPS), a detection framework that identifies adversarial examples by measuring how embedding representations change when important words are masked. GPS first ranks words using importance heuristics, then measures embedding sensitivity to masking top-k critical words, and processes the resulting patterns with a BiLSTM detector. Experiments show that adversarially perturbed words exhibit disproportionately high masking sensitivity compared to naturally important words. Across three datasets, three attack types, and two victim models, GPS achieves over 85% detection accuracy and demonstrates competitive performance compared to existing state-of-the-art methods, often at lower computational cost. Using Normalized Discounted Cumulative Gain (NDCG) to measure perturbation identification quality, we demonstrate that gradient-based ranking significantly outperforms attention, hybrid, and random selection approaches, with identification quality strongly correlating with detection performance for word-level attacks (). GPS generalizes to unseen datasets, attacks, and models without retraining, providing a practical solution for adversarial text detection.

Paper Structure

This paper contains 38 sections, 5 equations, 19 figures, 8 tables.

Figures (19)

  • Figure 1: Guided Perturbation Sensitivity (GPS) workflow for detecting adversarial text using top-$3$. After identifying important words (not shown), GPS measures embedding sensitivity to masking (1). The adversarial word terrible shows significantly higher sensitivity than its benign counterpart awful. The feature trace (2) displays paired bars for each word; the left bar shows importance scores, and the right bar shows sensitivity values, revealing distinctive patterns for adversarial words. A BiLSTM classifier (3) processes this trace to detect manipulated text.
  • Figure 2: Performance vs efficiency trade-off for GPS across different $K$ values on BERT-Attack adversarial examples using RoBERTa. The annotation box shows baseline computation times for comparison. GPS (28--249s) provides competitive timing with Sharp (51--86s) while significantly outperforming TextShield (111--233s), with the flexibility to trade computation time for detection accuracy via the $K$ parameter.
  • Figure 3: NDCG@k performance for ranking perturbed words on Yelp with RoBERTa across BERT-Attack, DeepWordBug, and TextFooler. Higher NDCG values indicate better ranking quality of truly perturbed words. Strategies: Rand (Random), Attn (Attention), GS (Grad-SAM), Grad (Gradient). Similar patterns hold across other dataset-model combinations.
  • Figure 4: Recall of perturbed words in top-20 rankings by perturbation count bins on Yelp with RoBERTa. Dot size indicates sample count per bin. Higher recall indicates better identification of truly perturbed words. Strategies: Rand (Random), Attn (Attention), GS (Grad-SAM), Grad (Gradient). Similar patterns hold across other dataset-model combinations.
  • Figure 5: Architecture of the BiLSTM‑based adversarial detector. The input trace $\mathbf{Z}$ is augmented with a binary mask identifying non‑zero positions and a linear positional encoding, then normalized to form $\mathbf{X}\in\mathbb{R}^{N\times C}$. After an input projection, $\mathbf{X}$ passes through a 2‑layer Bidirectional LSTM. Sequence outputs are summarized by a 2‑head self‑attention block, max‑pooling, and mean‑pooling; the three resulting vectors are concatenated. A feed‑forward classification head maps the pooled representation to logits for the benign vs. adversarial classes.
  • ...and 14 more figures