Table of Contents
Fetching ...

Adversarial Manipulation of Deep Representations

Sara Sabour, Yanshuai Cao, Fartash Faghri, David J. Fleet

TL;DR

The paper introduces feature adversaries, a class of adversarial images that, with imperceptible perturbations, force a DNN’s intermediate representations to resemble those of a chosen guide image while keeping the image perceptually similar to the source. It formalizes a constrained optimization to minimize representation distance at a target layer and validates the approach across multiple networks and datasets, showing that the adversarial encodings are often natural-looking and close to the guide in high-dimensional feature space. Through Euclidean and manifold-based analyses (PPCA tangent space and angular consistency) and comparisons with label-based adversaries, the authors demonstrate that the phenomenon reflects intrinsic properties of DNN representations rather than mere misclassification or linearity effects, and that architecture plays a substantial role, even in randomly weighted networks. These findings raise fundamental questions about the geometry of natural image manifolds in deep representations and have implications for model robustness and defense against sophisticated adversaries.

Abstract

We show that the representation of an image in a deep neural network (DNN) can be manipulated to mimic those of other natural images, with only minor, imperceptible perturbations to the original image. Previous methods for generating adversarial images focused on image perturbations designed to produce erroneous class labels, while we concentrate on the internal layers of DNN representations. In this way our new class of adversarial images differs qualitatively from others. While the adversary is perceptually similar to one image, its internal representation appears remarkably similar to a different image, one from a different class, bearing little if any apparent similarity to the input; they appear generic and consistent with the space of natural images. This phenomenon raises questions about DNN representations, as well as the properties of natural images themselves.

Adversarial Manipulation of Deep Representations

TL;DR

The paper introduces feature adversaries, a class of adversarial images that, with imperceptible perturbations, force a DNN’s intermediate representations to resemble those of a chosen guide image while keeping the image perceptually similar to the source. It formalizes a constrained optimization to minimize representation distance at a target layer and validates the approach across multiple networks and datasets, showing that the adversarial encodings are often natural-looking and close to the guide in high-dimensional feature space. Through Euclidean and manifold-based analyses (PPCA tangent space and angular consistency) and comparisons with label-based adversaries, the authors demonstrate that the phenomenon reflects intrinsic properties of DNN representations rather than mere misclassification or linearity effects, and that architecture plays a substantial role, even in randomly weighted networks. These findings raise fundamental questions about the geometry of natural image manifolds in deep representations and have implications for model robustness and defense against sophisticated adversaries.

Abstract

We show that the representation of an image in a deep neural network (DNN) can be manipulated to mimic those of other natural images, with only minor, imperceptible perturbations to the original image. Previous methods for generating adversarial images focused on image perturbations designed to produce erroneous class labels, while we concentrate on the internal layers of DNN representations. In this way our new class of adversarial images differs qualitatively from others. While the adversary is perceptually similar to one image, its internal representation appears remarkably similar to a different image, one from a different class, bearing little if any apparent similarity to the input; they appear generic and consistent with the space of natural images. This phenomenon raises questions about DNN representations, as well as the properties of natural images themselves.

Paper Structure

This paper contains 24 sections, 2 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Each row shows examples of adversarial images, optimized using different layers of Caffenet (FC$7$, P$5$, and C$3$), and different values of $\delta=(5, 10, 15)$. Beside each adversarial image is the difference between its corresponding source image.
  • Figure 2:
  • Figure 3: Histogram of the Euclidean distances between FC7 adversarial encodings ($\boldsymbol{\mathbf{\alpha}}$) and corresponding source (${\boldsymbol{\mathbf{s}}}$) and guide ($\boldsymbol{\mathbf{g}}$), for optimizations targetting FC7. Here, $d(x,y)$ is the distance between $x$ and $y$, $\overline{d}({\boldsymbol{\mathbf{s}}})$ denotes the average pairwise distances between points from images of the same class as the source, and $\overline{d_1}(\boldsymbol{\mathbf{g}})$ is the average distance to the nearest neighbor encoding among images with the same class as the guide. Histograms aggregate over all source-guide pairs.
  • Figure 4: Manifold inlier analysis: the first two columns (\ref{['sf:dl_fc7_train']},\ref{['sf:dl_fc7_val']},\ref{['sf:dl_pool5_train']},\ref{['sf:dl_pool5_val']}) for results of manifold tangent space analysis, showing distribution of difference in log likelihood of a point and $\boldsymbol{\mathbf{g}}$, $\Delta L(\cdot,\boldsymbol{\mathbf{g}})=L(\cdot)-L(\boldsymbol{\mathbf{g}})$; the last column (\ref{['sf:om_fc7_train']}),(\ref{['sf:om_pool5_train']}) for angular consistency analysis, showing distribution of angular consistency $\Omega(\cdot,g)$, between a point and $\boldsymbol{\mathbf{g}}$. See Eqn. \ref{['eq:omega_def']} for definitions.
  • Figure (a): Label-opt and feature-opt PPCA and rank measure comparison plots.
  • ...and 8 more figures