Table of Contents
Fetching ...

DiffuseDef: Improved Robustness to Adversarial Attacks via Iterative Denoising

Zhenhao Li, Huichi Zhou, Marek Rei, Lucia Specia

TL;DR

A novel and flexible adversarial defense method for language classification tasks, DiffuseDef, which incorporates a diffusion layer as a denoiser between the encoder and the classifier and achieves state-of-the-art performance against common black-box and white-box adversarial attacks.

Abstract

Pretrained language models have significantly advanced performance across various natural language processing tasks. However, adversarial attacks continue to pose a critical challenge to systems built using these models, as they can be exploited with carefully crafted adversarial texts. Inspired by the ability of diffusion models to predict and reduce noise in computer vision, we propose a novel and flexible adversarial defense method for language classification tasks, DiffuseDef, which incorporates a diffusion layer as a denoiser between the encoder and the classifier. The diffusion layer is trained on top of the existing classifier, ensuring seamless integration with any model in a plug-and-play manner. During inference, the adversarial hidden state is first combined with sampled noise, then denoised iteratively and finally ensembled to produce a robust text representation. By integrating adversarial training, denoising, and ensembling techniques, we show that DiffuseDef improves over existing adversarial defense methods and achieves state-of-the-art performance against common black-box and white-box adversarial attacks.

DiffuseDef: Improved Robustness to Adversarial Attacks via Iterative Denoising

TL;DR

A novel and flexible adversarial defense method for language classification tasks, DiffuseDef, which incorporates a diffusion layer as a denoiser between the encoder and the classifier and achieves state-of-the-art performance against common black-box and white-box adversarial attacks.

Abstract

Pretrained language models have significantly advanced performance across various natural language processing tasks. However, adversarial attacks continue to pose a critical challenge to systems built using these models, as they can be exploited with carefully crafted adversarial texts. Inspired by the ability of diffusion models to predict and reduce noise in computer vision, we propose a novel and flexible adversarial defense method for language classification tasks, DiffuseDef, which incorporates a diffusion layer as a denoiser between the encoder and the classifier. The diffusion layer is trained on top of the existing classifier, ensuring seamless integration with any model in a plug-and-play manner. During inference, the adversarial hidden state is first combined with sampled noise, then denoised iteratively and finally ensembled to produce a robust text representation. By integrating adversarial training, denoising, and ensembling techniques, we show that DiffuseDef improves over existing adversarial defense methods and achieves state-of-the-art performance against common black-box and white-box adversarial attacks.
Paper Structure (29 sections, 4 equations, 7 figures, 12 tables, 1 algorithm)

This paper contains 29 sections, 4 equations, 7 figures, 12 tables, 1 algorithm.

Figures (7)

  • Figure 1: Training and inference of DiffuseDef model. The adversarial training stage trains the pretrained encoder and classifier with perturbed input for adversarial robustness. The diffusion training trains the diffusion layer to predict injected noise at a given timestep $t$. At inference time, the text hidden state is first noised by 1 step and then denoised by $t^\prime$ steps to create the denoised hidden states, which are ensembled to make the final prediction.
  • Figure 2: Robustness of DiffuseDef with textual adversarial augmentation method.
  • Figure 3: AUA and #Query (TextFooler) w.r.t inference denoising step for DiffuseDef w/ and w/o ensembling.
  • Figure 4: Distribution of max token importance score in the AGNews test set.
  • Figure 5: Defense rate (against TextFooler) w.r.t token length for different models on IMDB dataset.
  • ...and 2 more figures