Table of Contents
Fetching ...

Adversarial Images for Variational Autoencoders

Pedro Tabacof, Julia Tavares, Eduardo Valle

TL;DR

Addresses how to craft adversarial inputs for autoencoders to make reconstructions resemble a chosen target image. Proposes a latent-space attack on both variational and deterministic autoencoders, optimizing input perturbations to align the adversarial latent with the target latent under a distortion budget. Empirically, autoencoders show substantial robustness compared with classifiers: small input changes yield little reconstruction change, and larger changes yield proportional improvements toward the target with a saturation point. The study demonstrates this behavior on MNIST and SVHN and discusses implications for robustness and design of generative models, suggesting future exploration of more complex architectures and theory.

Abstract

We investigate adversarial attacks for autoencoders. We propose a procedure that distorts the input image to mislead the autoencoder in reconstructing a completely different target image. We attack the internal latent representations, attempting to make the adversarial input produce an internal representation as similar as possible as the target's. We find that autoencoders are much more robust to the attack than classifiers: while some examples have tolerably small input distortion, and reasonable similarity to the target image, there is a quasi-linear trade-off between those aims. We report results on MNIST and SVHN datasets, and also test regular deterministic autoencoders, reaching similar conclusions in all cases. Finally, we show that the usual adversarial attack for classifiers, while being much easier, also presents a direct proportion between distortion on the input, and misdirection on the output. That proportionality however is hidden by the normalization of the output, which maps a linear layer into non-linear probabilities.

Adversarial Images for Variational Autoencoders

TL;DR

Addresses how to craft adversarial inputs for autoencoders to make reconstructions resemble a chosen target image. Proposes a latent-space attack on both variational and deterministic autoencoders, optimizing input perturbations to align the adversarial latent with the target latent under a distortion budget. Empirically, autoencoders show substantial robustness compared with classifiers: small input changes yield little reconstruction change, and larger changes yield proportional improvements toward the target with a saturation point. The study demonstrates this behavior on MNIST and SVHN and discusses implications for robustness and design of generative models, suggesting future exploration of more complex architectures and theory.

Abstract

We investigate adversarial attacks for autoencoders. We propose a procedure that distorts the input image to mislead the autoencoder in reconstructing a completely different target image. We attack the internal latent representations, attempting to make the adversarial input produce an internal representation as similar as possible as the target's. We find that autoencoders are much more robust to the attack than classifiers: while some examples have tolerably small input distortion, and reasonable similarity to the target image, there is a quasi-linear trade-off between those aims. We report results on MNIST and SVHN datasets, and also test regular deterministic autoencoders, reaching similar conclusions in all cases. Finally, we show that the usual adversarial attack for classifiers, while being much easier, also presents a direct proportion between distortion on the input, and misdirection on the output. That proportionality however is hidden by the normalization of the output, which maps a linear layer into non-linear probabilities.

Paper Structure

This paper contains 6 sections, 5 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Autoencoders are models able to map their input into a (deterministic or stochastic) latent representation, and then to map such representation into an output similar to the input; those two maps form the two halves of the model: the encoder and the decoder.
  • Figure 2: Adversarial attacks for autoencoders add (ideally small) distortions to the input, aiming at making the autoencoder reconstruct a different target. We attack the latent representation, attempting to match it to the target image's.
  • Figure 3: Top row: MNIST. Bottom row: SVHN. The figures on the left show the trade-off between the quality of adversarial attack and the adversarial distortion magnitude, with changing regularization parameter (implicit in the graphs, chosen from a logarithmic scale). The figures on the right correspond to the points shown in red in the graphs, illustrating adversarial images and reconstructions using fully-connected, and convolutional variational autoencoders (for MNIST and SVHN, respectively).
  • Figure 4: Plots for the whole set of experiments in MNIST and SVHN. Top: variational autoencoders (VAE). Bottom: deterministic autoencoders (AE). Each line in a graph corresponds to one experiment with adversarial images from a single pair of original/target images, varying the regularization parameter $C$ (like shown in Fig. \ref{['fig:AdvExamples']}). The “distortion” and “adversarial$-$target” axes show the trade-off between cost and success. The “hinge” where the lines saturate show the point where the reconstruction is essentially equal to the target's: the distortion at the hinge measures the resistance to the attack.
  • Figure 5: Examples for the classification attacks. Top: MNIST. Bottom: SVHN. Left: probabilities. Middle: logit transform of probabilities. Right: images illustrating the intersection point of the curves. The adversarial class is ‘4’ for MNIST, and ‘0’ for SVHN. The red curve shows the probability/logit for the adversarial class, and the blue curve shows the same for the original class: the point where the curves cross is the transition point between failure and success of the attack.
  • ...and 1 more figures