Table of Contents
Fetching ...

MaskPure: Improving Defense Against Text Adversaries with Stochastic Purification

Harrison Gietz, Jugal Kalita

TL;DR

MaskPure presents a diffusion-inspired, stochastic text purification method that robustifies NLP classifiers without adversarial training by masking and refilling tokens and aggregating predictions with voting. The approach formalizes a smoothed classifier via random masking and masked-language-model filling, and provides certifiable robustness bounds through probability-based guarantees. Empirically, MaskPure achieves stronger robustness than recent diffusion-based and random-perturbation defenses on AG News and IMDB, including character- and word-level attacks, while maintaining high clean accuracy. The method is supported by theoretical robustness certificates and Monte Carlo estimations, highlighting the practical impact of stochastic purification for reliable NLP systems.

Abstract

The improvement of language model robustness, including successful defense against adversarial attacks, remains an open problem. In computer vision settings, the stochastic noising and de-noising process provided by diffusion models has proven useful for purifying input images, thus improving model robustness against adversarial attacks. Similarly, some initial work has explored the use of random noising and de-noising to mitigate adversarial attacks in an NLP setting, but improving the quality and efficiency of these methods is necessary for them to remain competitive. We extend upon methods of input text purification that are inspired by diffusion processes, which randomly mask and refill portions of the input text before classification. Our novel method, MaskPure, exceeds or matches robustness compared to other contemporary defenses, while also requiring no adversarial classifier training and without assuming knowledge of the attack type. In addition, we show that MaskPure is provably certifiably robust. To our knowledge, MaskPure is the first stochastic-purification method with demonstrated success against both character-level and word-level attacks, indicating the generalizable and promising nature of stochastic denoising defenses. In summary: the MaskPure algorithm bridges literature on the current strongest certifiable and empirical adversarial defense methods, showing that both theoretical and practical robustness can be obtained together. Code is available on GitHub at https://github.com/hubarruby/MaskPure.

MaskPure: Improving Defense Against Text Adversaries with Stochastic Purification

TL;DR

MaskPure presents a diffusion-inspired, stochastic text purification method that robustifies NLP classifiers without adversarial training by masking and refilling tokens and aggregating predictions with voting. The approach formalizes a smoothed classifier via random masking and masked-language-model filling, and provides certifiable robustness bounds through probability-based guarantees. Empirically, MaskPure achieves stronger robustness than recent diffusion-based and random-perturbation defenses on AG News and IMDB, including character- and word-level attacks, while maintaining high clean accuracy. The method is supported by theoretical robustness certificates and Monte Carlo estimations, highlighting the practical impact of stochastic purification for reliable NLP systems.

Abstract

The improvement of language model robustness, including successful defense against adversarial attacks, remains an open problem. In computer vision settings, the stochastic noising and de-noising process provided by diffusion models has proven useful for purifying input images, thus improving model robustness against adversarial attacks. Similarly, some initial work has explored the use of random noising and de-noising to mitigate adversarial attacks in an NLP setting, but improving the quality and efficiency of these methods is necessary for them to remain competitive. We extend upon methods of input text purification that are inspired by diffusion processes, which randomly mask and refill portions of the input text before classification. Our novel method, MaskPure, exceeds or matches robustness compared to other contemporary defenses, while also requiring no adversarial classifier training and without assuming knowledge of the attack type. In addition, we show that MaskPure is provably certifiably robust. To our knowledge, MaskPure is the first stochastic-purification method with demonstrated success against both character-level and word-level attacks, indicating the generalizable and promising nature of stochastic denoising defenses. In summary: the MaskPure algorithm bridges literature on the current strongest certifiable and empirical adversarial defense methods, showing that both theoretical and practical robustness can be obtained together. Code is available on GitHub at https://github.com/hubarruby/MaskPure.
Paper Structure (18 sections, 1 theorem, 11 equations, 1 figure, 3 tables)

This paper contains 18 sections, 1 theorem, 11 equations, 1 figure, 3 tables.

Key Result

theorem 1

For an original text $x$ and an adversarial text $x^{\prime}$, if $\left\|x-x^{\prime}\right\|_0 \leq d$, then: $\forall c \in \mathcal{Y}$. Here,

Figures (1)

  • Figure 1: The pipeline for the MaskPure purification process, as demonstrated using an example from the AG News dataset. In this illustrative example, the perturbed sample contains an adversarial word, "begun" that leads to misclassification. The masking, filling, and voting process allows the classifier to correctly recover the correct label, Sports.

Theorems & Definitions (1)

  • theorem 1