Table of Contents
Fetching ...

Improving Adversarial Robustness of Ensembles with Diversity Training

Sanjay Kariyappa, Moinuddin K. Qureshi

TL;DR

The paper tackles transfer-based adversarial attacks by leveraging ensembles with uncorrelated loss gradients. It introduces Gradient Alignment Loss to quantify and minimize overlap of adversarial subspaces, producing a Diverse Ensemble (DivTrain). Experiments on MNIST and CIFAR-10 show DivTrain improves robustness against black-box attacks and can be combined with Ensemble Adversarial Training to further strengthen defense. The approach offers a practical defense for deployments where transferability is a key threat and gradient-based attacks are a concern.

Abstract

Deep Neural Networks are vulnerable to adversarial attacks even in settings where the attacker has no direct access to the model being attacked. Such attacks usually rely on the principle of transferability, whereby an attack crafted on a surrogate model tends to transfer to the target model. We show that an ensemble of models with misaligned loss gradients can provide an effective defense against transfer-based attacks. Our key insight is that an adversarial example is less likely to fool multiple models in the ensemble if their loss functions do not increase in a correlated fashion. To this end, we propose Diversity Training, a novel method to train an ensemble of models with uncorrelated loss functions. We show that our method significantly improves the adversarial robustness of ensembles and can also be combined with existing methods to create a stronger defense.

Improving Adversarial Robustness of Ensembles with Diversity Training

TL;DR

The paper tackles transfer-based adversarial attacks by leveraging ensembles with uncorrelated loss gradients. It introduces Gradient Alignment Loss to quantify and minimize overlap of adversarial subspaces, producing a Diverse Ensemble (DivTrain). Experiments on MNIST and CIFAR-10 show DivTrain improves robustness against black-box attacks and can be combined with Ensemble Adversarial Training to further strengthen defense. The approach offers a practical defense for deployments where transferability is a key threat and gradient-based attacks are a concern.

Abstract

Deep Neural Networks are vulnerable to adversarial attacks even in settings where the attacker has no direct access to the model being attacked. Such attacks usually rely on the principle of transferability, whereby an attack crafted on a surrogate model tends to transfer to the target model. We show that an ensemble of models with misaligned loss gradients can provide an effective defense against transfer-based attacks. Our key insight is that an adversarial example is less likely to fool multiple models in the ensemble if their loss functions do not increase in a correlated fashion. To this end, we propose Diversity Training, a novel method to train an ensemble of models with uncorrelated loss functions. We show that our method significantly improves the adversarial robustness of ensembles and can also be combined with existing methods to create a stronger defense.

Paper Structure

This paper contains 23 sections, 8 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Venn diagram illustrations of the adversarial subspace of (a) single model (b) Ensemble of 3 models and (c) Diverse Ensemble. Our goal is to reduce the overlap in the adversarial subspaces of the models in the ensemble as shown in (c)
  • Figure 2: Illustration showing the relationship between the gradient alignment and overlap of adversarial subspaces of two models. Misaligned gradients indicate a smaller overlap in the adversarial subspace.
  • Figure 4: Histogram of Coherence values plotted for Conv-3, Conv-4 and Resnet-20 comparing $T_{Base}$, $T_{Div}$, $T_{Ens}$ and $T_{Ens+Div}$. Models trained with Diversity Training (GAL regularization) have misaligned gradient vectors with lower coherence values.
  • Figure 5: Gradient aligned adversarial subspace analysis performed on Conv-4 with CIFAR-10. Plots show the probability of finding $k$ orthogonal adversarial directions for 3 different perturbation sizes $\epsilon=0.03/0.06/0.09$. $T_{Div}$ has consistently lower probabilities of finding an adversarial direction compared to $T_{Base}$ showing that DivTrain lowers the dimensionality of the Adv-SS of an ensemble.
  • Figure : Structure of the models used in our evaluations