Table of Contents
Fetching ...

Enhancing Adversarial Robustness via Uncertainty-Aware Distributional Adversarial Training

Junhao Dong, Xinghua Qu, Z. Jane Wang, Yew-Soon Ong

TL;DR

A novel uncertainty-aware distributional adversarial training method, which enforces adversary modeling by leveraging both the statistical information of adversarial examples and its corresponding uncertainty estimation, with the goal of augmenting the diversity of adversaries.

Abstract

Despite remarkable achievements in deep learning across various domains, its inherent vulnerability to adversarial examples still remains a critical concern for practical deployment. Adversarial training has emerged as one of the most effective defensive techniques for improving model robustness against such malicious inputs. However, existing adversarial training schemes often lead to limited generalization ability against underlying adversaries with diversity due to their overreliance on a point-by-point augmentation strategy by mapping each clean example to its adversarial counterpart during training. In addition, adversarial examples can induce significant disruptions in the statistical information w.r.t. the target model, thereby introducing substantial uncertainty and challenges to modeling the distribution of adversarial examples. To circumvent these issues, in this paper, we propose a novel uncertainty-aware distributional adversarial training method, which enforces adversary modeling by leveraging both the statistical information of adversarial examples and its corresponding uncertainty estimation, with the goal of augmenting the diversity of adversaries. Considering the potentially negative impact induced by aligning adversaries to misclassified clean examples, we also refine the alignment reference based on the statistical proximity to clean examples during adversarial training, thereby reframing adversarial training within a distribution-to-distribution matching framework interacted between the clean and adversarial domains. Furthermore, we design an introspective gradient alignment approach via matching input gradients between these domains without introducing external models. Extensive experiments across four benchmark datasets and various network architectures demonstrate that our approach achieves state-of-the-art adversarial robustness and maintains natural performance.

Enhancing Adversarial Robustness via Uncertainty-Aware Distributional Adversarial Training

TL;DR

A novel uncertainty-aware distributional adversarial training method, which enforces adversary modeling by leveraging both the statistical information of adversarial examples and its corresponding uncertainty estimation, with the goal of augmenting the diversity of adversaries.

Abstract

Despite remarkable achievements in deep learning across various domains, its inherent vulnerability to adversarial examples still remains a critical concern for practical deployment. Adversarial training has emerged as one of the most effective defensive techniques for improving model robustness against such malicious inputs. However, existing adversarial training schemes often lead to limited generalization ability against underlying adversaries with diversity due to their overreliance on a point-by-point augmentation strategy by mapping each clean example to its adversarial counterpart during training. In addition, adversarial examples can induce significant disruptions in the statistical information w.r.t. the target model, thereby introducing substantial uncertainty and challenges to modeling the distribution of adversarial examples. To circumvent these issues, in this paper, we propose a novel uncertainty-aware distributional adversarial training method, which enforces adversary modeling by leveraging both the statistical information of adversarial examples and its corresponding uncertainty estimation, with the goal of augmenting the diversity of adversaries. Considering the potentially negative impact induced by aligning adversaries to misclassified clean examples, we also refine the alignment reference based on the statistical proximity to clean examples during adversarial training, thereby reframing adversarial training within a distribution-to-distribution matching framework interacted between the clean and adversarial domains. Furthermore, we design an introspective gradient alignment approach via matching input gradients between these domains without introducing external models. Extensive experiments across four benchmark datasets and various network architectures demonstrate that our approach achieves state-of-the-art adversarial robustness and maintains natural performance.

Paper Structure

This paper contains 21 sections, 12 equations, 10 figures, 12 tables, 1 algorithm.

Figures (10)

  • Figure 1: (a) The decision boundary of a standard adversarially trained model MadryMSTV18 can be overfitted to adversaries generated in the point-by-point strategy. A larger perturbation radius of adversarial examples leads to an increase in the average (b) feature variance values (%) and (c) gradient norm values (%).
  • Figure 2: t-SNE visualization of legitimate feature and its 500 adversarial counterparts generated via the random start strategy on CIFAR-10 krizhevsky2009learning. Each figure refers to a specific clean input.
  • Figure 3: (a) Average robust accuracy, (b) feature variance, and (c) gradient norm under different attack strengths for an adversarially trained model (PGD-AT MadryMSTV18). Test samples are ranked according to their cross-entropy loss values in ascending order and subsequently divide them into two equal halves. The negative perturbation radius denotes the benign refinement of clean samples by inverting the gradient sign during adversary generation. The green line denotes the clean samples.
  • Figure 4: Accuracy-robustness trade-off of our UAD-AT method by tuning the hyper-parameter $\beta$. We report both clean accuracy (%) and (Auto-Attack) robust accuracy (%).
  • Figure 5: t-SNE visualization of the legitimate feature, adversarial features, and augmented features on CIFAR-10 krizhevsky2009learning. Each figure refers to a specific clean input.
  • ...and 5 more figures