Table of Contents
Fetching ...

Defensive Dual Masking for Robust Adversarial Defense

Wangli Yang, Jie Yang, Yi Guo, Johan Barthelemy

TL;DR

This work tackles the vulnerability of NLP models to adversarial text by proposing Defensive Dual Masking (DDM), a masking-based defense that inserts [MASK] tokens during training and replaces potentially adversarial tokens with [MASK] at inference. DDM preserves the vanilla model architecture and loss, avoiding data-generation or ensembling overhead, and leverages a simple masking budget to improve robustness across word-level and character-level attacks. The authors provide theoretical analysis showing that, under mild conditions, reconstructed representations favor masked/residual token combinations that mitigate adversarial influence, and they demonstrate empirical superiority over a broad set of baselines on AGNews and MR, with notable gains and applicability to LLMs. Overall, DDM offers a lightweight, scalable, and effective defense that can be readily integrated into existing NLP systems and extended to large-scale models.

Abstract

The field of textual adversarial defenses has gained considerable attention in recent years due to the increasing vulnerability of natural language processing (NLP) models to adversarial attacks, which exploit subtle perturbations in input text to deceive models. This paper introduces the Defensive Dual Masking (DDM) algorithm, a novel approach designed to enhance model robustness against such attacks. DDM utilizes a unique adversarial training strategy where [MASK] tokens are strategically inserted into training samples to prepare the model to handle adversarial perturbations more effectively. During inference, potentially adversarial tokens are dynamically replaced with [MASK] tokens to neutralize potential threats while preserving the core semantics of the input. The theoretical foundation of our approach is explored, demonstrating how the selective masking mechanism strengthens the model's ability to identify and mitigate adversarial manipulations. Our empirical evaluation across a diverse set of benchmark datasets and attack mechanisms consistently shows that DDM outperforms state-of-the-art defense techniques, improving model accuracy and robustness. Moreover, when applied to Large Language Models (LLMs), DDM also enhances their resilience to adversarial attacks, providing a scalable defense mechanism for large-scale NLP applications.

Defensive Dual Masking for Robust Adversarial Defense

TL;DR

This work tackles the vulnerability of NLP models to adversarial text by proposing Defensive Dual Masking (DDM), a masking-based defense that inserts [MASK] tokens during training and replaces potentially adversarial tokens with [MASK] at inference. DDM preserves the vanilla model architecture and loss, avoiding data-generation or ensembling overhead, and leverages a simple masking budget to improve robustness across word-level and character-level attacks. The authors provide theoretical analysis showing that, under mild conditions, reconstructed representations favor masked/residual token combinations that mitigate adversarial influence, and they demonstrate empirical superiority over a broad set of baselines on AGNews and MR, with notable gains and applicability to LLMs. Overall, DDM offers a lightweight, scalable, and effective defense that can be readily integrated into existing NLP systems and extended to large-scale models.

Abstract

The field of textual adversarial defenses has gained considerable attention in recent years due to the increasing vulnerability of natural language processing (NLP) models to adversarial attacks, which exploit subtle perturbations in input text to deceive models. This paper introduces the Defensive Dual Masking (DDM) algorithm, a novel approach designed to enhance model robustness against such attacks. DDM utilizes a unique adversarial training strategy where [MASK] tokens are strategically inserted into training samples to prepare the model to handle adversarial perturbations more effectively. During inference, potentially adversarial tokens are dynamically replaced with [MASK] tokens to neutralize potential threats while preserving the core semantics of the input. The theoretical foundation of our approach is explored, demonstrating how the selective masking mechanism strengthens the model's ability to identify and mitigate adversarial manipulations. Our empirical evaluation across a diverse set of benchmark datasets and attack mechanisms consistently shows that DDM outperforms state-of-the-art defense techniques, improving model accuracy and robustness. Moreover, when applied to Large Language Models (LLMs), DDM also enhances their resilience to adversarial attacks, providing a scalable defense mechanism for large-scale NLP applications.

Paper Structure

This paper contains 21 sections, 3 theorems, 20 equations, 4 figures, 3 tables.

Key Result

Lemma 3.1

Let two vectors $\mathbf{a}$, $\mathbf{b}\in\mathbb R^d$. $\tilde{\mathbf{a}}$ uniformly distributed between the origin $\mathbf{o}$ and $\mathbf{a}$, and similarly $\tilde{\mathbf{b}}$ uniformly distributed between $\mathbf{o}$ and $\mathbf{b}$ independent of $\tilde{\mathbf{a}}$. Then where $\|\mathbf{x}\|$ is the $\ell_2$ norm of vector $\mathbf{x}$ and $\mathbb E$ is the expectation.

Figures (4)

  • Figure 1: Comparison of existing adversarial defensing methods.
  • Figure 2: The workflow of our proposed method, that preserves the encoder architecture and loss function as the vanilla model. Its distinctiveness lies in integrating [MASK] tokens into input sequences during both training and inference stages.
  • Figure 3: The token geometry where $\mathbf{a}$, $\mathbf{r}$, $\mathbf{s}$, and $\mathbf{m}$ represent the victim token, replaced token, compressed unchanged tokens, and [MASK] token, respectively. $\tilde{\mathbf{x}}_a$, $\tilde{\mathbf{x}}_m$, $\tilde{\mathbf{x}}_r$, and $\tilde{\mathbf{x}}_{mr}$ denote the reconstructed [CLS] token using combinations of $\mathbf{a}$, $\mathbf{m}$, $\mathbf{r}$, and $\mathbf{s}$.
  • Figure 4: $f_0(x)$ and $f_e(x)$ function values.

Theorems & Definitions (7)

  • Lemma 3.1
  • proof
  • Lemma 3.2
  • proof
  • Theorem 3.3: Success condition for DDM
  • proof
  • Remark 3.4