Table of Contents
Fetching ...

Pre-trained Multiple Latent Variable Generative Models are good defenders against Adversarial Attacks

Dario Serez, Marco Cristani, Alessio Del Bue, Vittorio Murino, Pietro Morerio

TL;DR

Adversarial perturbations threaten classifier reliability, and existing defenses often require costly training. The authors propose a training-free adversarial purification framework that leverages pre-trained Multiple Latent Variable Generative Models (MLVGMs) to disentangle global class-relevant information from local adversarial detail across multiple latent codes, enabling robust purification without task-specific retraining. The method encodes an input to latent codes $z^{ ext{e}}_i$, samples $z^{ ext{s}}_i$ from priors, and decodes interpolated latents $z_i = (1 - \alpha_i) z^{ ext{e}}_i + \alpha_i z^{ ext{s}}_i$ with $0 \le \alpha_i \le 1$ to produce purified images; hyperparameters $\alpha_i$ can be found via Bayesian Optimization or fixed monotonic schedules. Experiments on CelebA Gender, CelebA identities, and Stanford Cars with StyleGAN2 and NVAE show the approach is competitive with or close to specialized purification methods (TRADES, A-VAE, ND-VAE) despite using smaller models and no task-specific training, highlighting the potential of MLVGMs as foundation models for defense. The work suggests a scalable path toward robust vision systems and motivates releasing stronger, billions-of-sample-trained MLVGMs for broader downstream use.

Abstract

Attackers can deliberately perturb classifiers' input with subtle noise, altering final predictions. Among proposed countermeasures, adversarial purification employs generative networks to preprocess input images, filtering out adversarial noise. In this study, we propose specific generators, defined Multiple Latent Variable Generative Models (MLVGMs), for adversarial purification. These models possess multiple latent variables that naturally disentangle coarse from fine features. Taking advantage of these properties, we autoencode images to maintain class-relevant information, while discarding and re-sampling any detail, including adversarial noise. The procedure is completely training-free, exploring the generalization abilities of pre-trained MLVGMs on the adversarial purification downstream task. Despite the lack of large models, trained on billions of samples, we show that smaller MLVGMs are already competitive with traditional methods, and can be used as foundation models. Official code released at https://github.com/SerezD/gen_adversarial.

Pre-trained Multiple Latent Variable Generative Models are good defenders against Adversarial Attacks

TL;DR

Adversarial perturbations threaten classifier reliability, and existing defenses often require costly training. The authors propose a training-free adversarial purification framework that leverages pre-trained Multiple Latent Variable Generative Models (MLVGMs) to disentangle global class-relevant information from local adversarial detail across multiple latent codes, enabling robust purification without task-specific retraining. The method encodes an input to latent codes , samples from priors, and decodes interpolated latents with to produce purified images; hyperparameters can be found via Bayesian Optimization or fixed monotonic schedules. Experiments on CelebA Gender, CelebA identities, and Stanford Cars with StyleGAN2 and NVAE show the approach is competitive with or close to specialized purification methods (TRADES, A-VAE, ND-VAE) despite using smaller models and no task-specific training, highlighting the potential of MLVGMs as foundation models for defense. The work suggests a scalable path toward robust vision systems and motivates releasing stronger, billions-of-sample-trained MLVGMs for broader downstream use.

Abstract

Attackers can deliberately perturb classifiers' input with subtle noise, altering final predictions. Among proposed countermeasures, adversarial purification employs generative networks to preprocess input images, filtering out adversarial noise. In this study, we propose specific generators, defined Multiple Latent Variable Generative Models (MLVGMs), for adversarial purification. These models possess multiple latent variables that naturally disentangle coarse from fine features. Taking advantage of these properties, we autoencode images to maintain class-relevant information, while discarding and re-sampling any detail, including adversarial noise. The procedure is completely training-free, exploring the generalization abilities of pre-trained MLVGMs on the adversarial purification downstream task. Despite the lack of large models, trained on billions of samples, we show that smaller MLVGMs are already competitive with traditional methods, and can be used as foundation models. Official code released at https://github.com/SerezD/gen_adversarial.

Paper Structure

This paper contains 25 sections, 5 equations, 19 figures, 3 tables, 1 algorithm.

Figures (19)

  • Figure 1: (a): Overview of the adversarial attack and purification mechanism. The attacker subtly perturbs a source image to alter its prediction label (top to center). Purification (bottom) seeks to correct adversarial examples to the right class without affecting clean samples. (b): A Multiple Latent Variable Generative Model (MLVGM) maps latent variables (or codes, here $\mathbf{z}_0, \mathbf{z}_1, \mathbf{z}_2$) sampled from a known prior distribution, to high-quality images. (c): Each code impacts the image differently, from global to local features. From left to right: original image and those generated by replacing variables $\mathbf{z}_0$ with $\mathbf{z}_0'$, $\mathbf{z}_1$ with $\mathbf{z}_1'$, and $\mathbf{z}_2$ with $\mathbf{z}_2'$, respectively.
  • Figure 2: The overall architecture of our framework, depicting the optional preprocessing phase (adding gaussian noise or blurring the input image) and autoencoding with pre-trained MLVGMs, which do not require further training. Inside the central rectangle, we show the latent purification process that is the core of our method, consisting of three main steps: encoding, sampling and interpolation.
  • Figure 3: Analyisis of $\alpha$ combinations on the Celeb-A HQ Gender task. (a) Spearman's index ($\rho$) vs accuracy for $512$ random combinations, obtaining a Pearson's linear correlation value of $0.267$. (b) Comparison of the $18$ final $\alpha$ values for the linear, cosine and learned combinations. (c) Attack success rates (the lower the better) for increasing $\textit{L}_2$ bounds on each tested attack and combination.
  • Figure 4: Analyisis of $\alpha$ combinations on the Celeb-A 64 Identities task. (a) Spearman's index ($\rho$) vs accuracy for $512$ random combinations, obtaining a Pearson's linear correlation value of $0.607$. (b) Comparison of the $24$ final $\alpha$ values for the linear, cosine and learned combinations. (c) Attack success rates (the lower the better) for increasing $\textit{L}_2$ bounds on each tested attack and combination.
  • Figure 5: Analyisis of $\alpha$ combinations on the Stanford Cars 128 task. (a) Spearman's index ($\rho$) vs accuracy for $512$ random combinations, obtaining a Pearson's linear correlation value of $0.211$. (b) Comparison of the $16$ final $\alpha$ values for the linear, cosine and learned combinations. (c) Attack success rates (the lower the better) for increasing $\textit{L}_2$ bounds on each tested attack and combination.
  • ...and 14 more figures