Less is More: Understanding Word-level Textual Adversarial Attack via n-gram Frequency Descend
Ning Lu, Shengcai Liu, Zhirui Zhang, Qi Wang, Haifeng Liu, Ke Tang
TL;DR
This work investigates word-level textual adversarial attacks through the lens of $n$-gram frequency, revealing a prevalent $n$-gram Frequency Descend ($n$-FD) pattern across attacks, models, and datasets. It demonstrates that training with $n$-FD examples can achieve robustness comparable to gradient-based adversarial training by integrating frequency descent into a convex-hull defense (ADV-F), with 2-gram ($n=2$) frequency providing the strongest robustness gains. The key contributions are empirical evidence of the $n$-FD tendency, a frequency-based adversarial training framework (ADV-F1/ADV-F2) within the convex hull paradigm, and guidance on selecting $n$ for robustness improvement. The findings offer a more intuitive understanding of word-level attacks and present a practical, efficient defense mechanism that can inform robust NLP deployment.
Abstract
Word-level textual adversarial attacks have demonstrated notable efficacy in misleading Natural Language Processing (NLP) models. Despite their success, the underlying reasons for their effectiveness and the fundamental characteristics of adversarial examples (AEs) remain obscure. This work aims to interpret word-level attacks by examining their $n$-gram frequency patterns. Our comprehensive experiments reveal that in approximately 90\% of cases, word-level attacks lead to the generation of examples where the frequency of $n$-grams decreases, a tendency we term as the $n$-gram Frequency Descend ($n$-FD). This finding suggests a straightforward strategy to enhance model robustness: training models using examples with $n$-FD. To examine the feasibility of this strategy, we employed the $n$-gram frequency information, as an alternative to conventional loss gradients, to generate perturbed examples in adversarial training. The experiment results indicate that the frequency-based approach performs comparably with the gradient-based approach in improving model robustness. Our research offers a novel and more intuitive perspective for understanding word-level textual adversarial attacks and proposes a new direction to improve model robustness.
