Table of Contents
Fetching ...

The Space of Transferable Adversarial Examples

Florian Tramèr, Nicolas Papernot, Ian Goodfellow, Dan Boneh, Patrick McDaniel

TL;DR

The paper investigates why adversarial examples often transfer across models by quantifying the dimensionality of adversarial subspaces and analyzing decision boundary proximity. It introduces GAAS to identify many orthogonal adversarial directions, revealing a high-dimensional, contiguous subspace whose perturbations largely transfer between models, even across architectures. It then links transferability to boundary geometry via distance-based metrics and demonstrates that adversarial training offers limited displacement of decision boundaries, leaving room for black-box attacks. Finally, the work provides sufficient conditions for transferability via model-agnostic perturbations and presents counterexamples (including XOR artifacts) showing transfer is not universal, suggesting potential defenses tailored to data and representation properties.

Abstract

Adversarial examples are maliciously perturbed inputs designed to mislead machine learning (ML) models at test-time. They often transfer: the same adversarial example fools more than one model. In this work, we propose novel methods for estimating the previously unknown dimensionality of the space of adversarial inputs. We find that adversarial examples span a contiguous subspace of large (~25) dimensionality. Adversarial subspaces with higher dimensionality are more likely to intersect. We find that for two different models, a significant fraction of their subspaces is shared, thus enabling transferability. In the first quantitative analysis of the similarity of different models' decision boundaries, we show that these boundaries are actually close in arbitrary directions, whether adversarial or benign. We conclude by formally studying the limits of transferability. We derive (1) sufficient conditions on the data distribution that imply transferability for simple model classes and (2) examples of scenarios in which transfer does not occur. These findings indicate that it may be possible to design defenses against transfer-based attacks, even for models that are vulnerable to direct attacks.

The Space of Transferable Adversarial Examples

TL;DR

The paper investigates why adversarial examples often transfer across models by quantifying the dimensionality of adversarial subspaces and analyzing decision boundary proximity. It introduces GAAS to identify many orthogonal adversarial directions, revealing a high-dimensional, contiguous subspace whose perturbations largely transfer between models, even across architectures. It then links transferability to boundary geometry via distance-based metrics and demonstrates that adversarial training offers limited displacement of decision boundaries, leaving room for black-box attacks. Finally, the work provides sufficient conditions for transferability via model-agnostic perturbations and presents counterexamples (including XOR artifacts) showing transfer is not universal, suggesting potential defenses tailored to data and representation properties.

Abstract

Adversarial examples are maliciously perturbed inputs designed to mislead machine learning (ML) models at test-time. They often transfer: the same adversarial example fools more than one model. In this work, we propose novel methods for estimating the previously unknown dimensionality of the space of adversarial inputs. We find that adversarial examples span a contiguous subspace of large (~25) dimensionality. Adversarial subspaces with higher dimensionality are more likely to intersect. We find that for two different models, a significant fraction of their subspaces is shared, thus enabling transferability. In the first quantitative analysis of the similarity of different models' decision boundaries, we show that these boundaries are actually close in arbitrary directions, whether adversarial or benign. We conclude by formally studying the limits of transferability. We derive (1) sufficient conditions on the data distribution that imply transferability for simple model classes and (2) examples of scenarios in which transfer does not occur. These findings indicate that it may be possible to design defenses against transfer-based attacks, even for models that are vulnerable to direct attacks.

Paper Structure

This paper contains 32 sections, 3 theorems, 14 equations, 7 figures, 3 tables.

Key Result

Lemma 1

Given ${\bm {g}} \in \mathbb{R}^d$ and $\alpha \in [0, 1]$. The maximum number $k$ of orthogonal vectors ${\bm {r}}_1, {\bm {r}}_2, \dots {\bm {r}}_k \in \mathbb{R}^d$ satisfying $\|{\bm {r}}_i\|_2 \leq 1$ and ${\bm {g}}^\top {\bm {r}}_i \geq \alpha \cdot \|{\bm {g}}\|_2$ is $k = \min\left\{\left\lf

Figures (7)

  • Figure 1: Illustration of the Gradient Aligned Adversarial Subspace (GAAS). The gradient aligned attack (red arrow) crosses the decision boundary. The black arrows are orthogonal vectors aligned with the gradient that span a subspace of potential adversarial inputs (orange).
  • Figure 2: Probability density function of the number of successful orthogonal adversarial perturbations found by the GAAS method on the source DNN model, and of the number of perturbations that transfer to the target DNN model.
  • Figure 3: The three directions (Legitimate, Adversarial and Random) used throughout Section \ref{['sec:boundaries']} to measure the distance between the decision boundaries of two models. The gray double-ended arrows illustrate the inter-boundary distance between the two models in each direction.
  • Figure 4: Minimum distances and inter-boundary distances in three directions for MNIST models. Each plot shows results for one source model (Logistic Regression, Support Vector Machine, Deep Neural Network), and all three classes of target models (one hatched bar per model class). Within each plot, bars are grouped by direction (legitimate, adversarial and random). The filled black bar shows the minimum distance to the decision boundary for the source model. The adversarial search uses the FGM with $\varepsilon=5$. For example, the left group in the left plot shows that the minimal distance on the Logistic Regression (LR) model in the legitimate direction is about $4$, and that the distance between the LRs boundary and the boundaries of other models in that direction is lower than $1$.
  • Figure 5: MNIST digits perturbed by adding the difference in class means.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Lemma 1
  • Theorem 2
  • Lemma 3
  • proof