Less is More: Understanding Word-level Textual Adversarial Attack via n-gram Frequency Descend

Ning Lu; Shengcai Liu; Zhirui Zhang; Qi Wang; Haifeng Liu; Ke Tang

Less is More: Understanding Word-level Textual Adversarial Attack via n-gram Frequency Descend

Ning Lu, Shengcai Liu, Zhirui Zhang, Qi Wang, Haifeng Liu, Ke Tang

TL;DR

This work investigates word-level textual adversarial attacks through the lens of $n$-gram frequency, revealing a prevalent $n$-gram Frequency Descend ($n$-FD) pattern across attacks, models, and datasets. It demonstrates that training with $n$-FD examples can achieve robustness comparable to gradient-based adversarial training by integrating frequency descent into a convex-hull defense (ADV-F), with 2-gram ($n=2$) frequency providing the strongest robustness gains. The key contributions are empirical evidence of the $n$-FD tendency, a frequency-based adversarial training framework (ADV-F1/ADV-F2) within the convex hull paradigm, and guidance on selecting $n$ for robustness improvement. The findings offer a more intuitive understanding of word-level attacks and present a practical, efficient defense mechanism that can inform robust NLP deployment.

Abstract

Word-level textual adversarial attacks have demonstrated notable efficacy in misleading Natural Language Processing (NLP) models. Despite their success, the underlying reasons for their effectiveness and the fundamental characteristics of adversarial examples (AEs) remain obscure. This work aims to interpret word-level attacks by examining their $n$-gram frequency patterns. Our comprehensive experiments reveal that in approximately 90\% of cases, word-level attacks lead to the generation of examples where the frequency of $n$-grams decreases, a tendency we term as the $n$-gram Frequency Descend ($n$-FD). This finding suggests a straightforward strategy to enhance model robustness: training models using examples with $n$-FD. To examine the feasibility of this strategy, we employed the $n$-gram frequency information, as an alternative to conventional loss gradients, to generate perturbed examples in adversarial training. The experiment results indicate that the frequency-based approach performs comparably with the gradient-based approach in improving model robustness. Our research offers a novel and more intuitive perspective for understanding word-level textual adversarial attacks and proposes a new direction to improve model robustness.

Less is More: Understanding Word-level Textual Adversarial Attack via n-gram Frequency Descend

TL;DR

This work investigates word-level textual adversarial attacks through the lens of

-gram frequency, revealing a prevalent

-gram Frequency Descend (

-FD) pattern across attacks, models, and datasets. It demonstrates that training with

-FD examples can achieve robustness comparable to gradient-based adversarial training by integrating frequency descent into a convex-hull defense (ADV-F), with 2-gram (

) frequency providing the strongest robustness gains. The key contributions are empirical evidence of the

-FD tendency, a frequency-based adversarial training framework (ADV-F1/ADV-F2) within the convex hull paradigm, and guidance on selecting

for robustness improvement. The findings offer a more intuitive understanding of word-level attacks and present a practical, efficient defense mechanism that can inform robust NLP deployment.

Abstract

-gram frequency patterns. Our comprehensive experiments reveal that in approximately 90\% of cases, word-level attacks lead to the generation of examples where the frequency of

-grams decreases, a tendency we term as the

-gram Frequency Descend (

-FD). This finding suggests a straightforward strategy to enhance model robustness: training models using examples with

-FD. To examine the feasibility of this strategy, we employed the

-gram frequency information, as an alternative to conventional loss gradients, to generate perturbed examples in adversarial training. The experiment results indicate that the frequency-based approach performs comparably with the gradient-based approach in improving model robustness. Our research offers a novel and more intuitive perspective for understanding word-level textual adversarial attacks and proposes a new direction to improve model robustness.

Paper Structure (28 sections, 10 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 28 sections, 10 equations, 6 figures, 3 tables, 1 algorithm.

Introduction
Understand Word-level Attacks from the $n$-FD Perspective
Preliminaries
Word-Level Textual Attacks
$n$-gram Frequency
$n$-gram Frequency Descend ($n$-FD)
$n$-FD Substitution
Adversarial Example Generation
Attacks
Dataset
Victim Models
Results and Analysis
Training on $n$-FD Examples Improves Robustness
$n$-FD Adversarial Training
Applying $n$-FD Adversarial Training to Convex Hull
...and 13 more sections

Figures (6)

Figure 1: Illustrations of two AEs exhibiting 1-FD and 2-FD, respectively. The 1-gram (blue numbers) and 2-gram (red numbers) frequency changes are presented. In the second AE, the substitution of "impressed" with "stunning" raises the 1-gram frequency ($6 \rightarrow 22$). However, it concurrently reduces the 2-gram frequency ($1\rightarrow 0, 4 \rightarrow 0$).
Figure 2: Distributions of the $n$-gram frequency changes induced by PWWS attack when attacking BERT on the IMDB dataset. The blue, orange, and purple bars represent the $n$-FD, $n$-FA, and $n$-FC examples, respectively. The exact percentage values are shown in the legend. From left to right, the value of $n$ varies from 1 to 4.
Figure 3: The confidence distribution for a CNN classifier on the clean examples (Orig), $n$-FD examples ($n$-FD) and $n$-FA examples ($n$-FA) from IMDB dataset. Confidence refers to the softmax probability of the true class. $n$ is 1 to 4 from left to right images. Models perform similarly on clean examples and $n$-FA examples, but worse on $n$-FD examples.
Figure 4: Confidence distribution of different models on $n$-FD examples. After training on AEs, models also achieve better performance on $n$-FD examples.
Figure 5: The $n$-gram frequency distribution of all training examples for ADV-G, ADV-F and standard training on IMDB. We normalize the frequencies with the training data size for comparison.Both ADV-G and ADV-F result in a more balanced $n$-gram frequency distribution, i.e., lower at the head but higher in the tail.
...and 1 more figures

Less is More: Understanding Word-level Textual Adversarial Attack via n-gram Frequency Descend

TL;DR

Abstract

Less is More: Understanding Word-level Textual Adversarial Attack via n-gram Frequency Descend

Authors

TL;DR

Abstract

Table of Contents

Figures (6)