Fooling the Textual Fooler via Randomizing Latent Representations

Duy C. Hoang; Quang H. Nguyen; Saurav Manchanda; MinLong Peng; Kok-Seng Wong; Khoa D. Doan

Fooling the Textual Fooler via Randomizing Latent Representations

Duy C. Hoang, Quang H. Nguyen, Saurav Manchanda, MinLong Peng, Kok-Seng Wong, Khoa D. Doan

TL;DR

This work tackles the vulnerability of NLP systems to black-box, word-level adversarial attacks by introducing AdvFooler, a lightweight defense that randomizes latent representations at inference to mislead attackers during synonym substitution. The approach is attack-agnostic and training-free, offering a favorable trade-off between robustness and clean accuracy, and it can be combined with adversarial training for additional gains. The authors provide theoretical support for randomized latent scores and extensive empirical evidence on AGNews and IMDB with multiple attacks, showing competitive robustness and minimal inference overhead. The defense is practical for real-world deployments, as it requires no perturbation-set knowledge and preserves high accuracy on clean data while increasing the adversary’s search burden.

Abstract

Despite outstanding performance in a variety of NLP tasks, recent studies have revealed that NLP models are vulnerable to adversarial attacks that slightly perturb the input to cause the models to misbehave. Among these attacks, adversarial word-level perturbations are well-studied and effective attack strategies. Since these attacks work in black-box settings, they do not require access to the model architecture or model parameters and thus can be detrimental to existing NLP applications. To perform an attack, the adversary queries the victim model many times to determine the most important words in an input text and to replace these words with their corresponding synonyms. In this work, we propose a lightweight and attack-agnostic defense whose main goal is to perplex the process of generating an adversarial example in these query-based black-box attacks; that is to fool the textual fooler. This defense, named AdvFooler, works by randomizing the latent representation of the input at inference time. Different from existing defenses, AdvFooler does not necessitate additional computational overhead during training nor relies on assumptions about the potential adversarial perturbation set while having a negligible impact on the model's accuracy. Our theoretical and empirical analyses highlight the significance of robustness resulting from confusing the adversary via randomizing the latent space, as well as the impact of randomization on clean accuracy. Finally, we empirically demonstrate near state-of-the-art robustness of AdvFooler against representative adversarial word-level attacks on two benchmark datasets.

Fooling the Textual Fooler via Randomizing Latent Representations

TL;DR

Abstract

Paper Structure (33 sections, 2 theorems, 5 equations, 5 figures, 17 tables, 1 algorithm)

This paper contains 33 sections, 2 theorems, 5 equations, 5 figures, 17 tables, 1 algorithm.

Introduction
Background
Adversarial attacks in NLP
Adversarial defense methods
Methodology
Randomized latent-space defense against adversarial word substitution
Effect of randomizing latent representations on adversarial attacks
Effect of randomizing latent representations on clean accuracy
Empirical analysis
Experiments
Experimental setup
Defense performance
Robustness from randomizing different latent spaces
Trade-off between clean accuracy and robustness
AdvFooler with adversarially trained models
...and 18 more sections

Key Result

Theorem 3.1

If a random vector $v\sim\mathcal{N}(0, \nu I)$ where $\nu$ is small is added to the hidden layer $h$ of the model $f$ which can be decomposed into $f=g\circ h$, the new important score $I^{\text{new}}_{\tilde{x}}$ is a random variable that follows Gaussian distribution $\mathcal{N}(I_{\tilde{x}}, \

Figures (5)

Figure 1: Loss changes when randomizing input (RanMASK/SAFER) and latent space (AdvFooler).
Figure 2: Illustration of each word’s important score when calculated with and without AdvFooler.
Figure 3: The robustness of AdvFooler when randomizing different layers of the model on AGNEWS.
Figure 4: Clean Accuracy and Accuracy under Attack (AuA) when using different noise scales $\nu$.
Figure 5: Illustration of each paraphrase's score when calculated with and without AdvFooler. In this case, MAYA selects the first paraphrase when attacking the baseline model, while selecting the last paraphrase when attacking the model protected by AdvFooler.

Theorems & Definitions (3)

Theorem 3.1
Theorem A.1
proof

Fooling the Textual Fooler via Randomizing Latent Representations

TL;DR

Abstract

Fooling the Textual Fooler via Randomizing Latent Representations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (3)