Table of Contents
Fetching ...

Blending adversarial training and representation-conditional purification via aggregation improves adversarial robustness

Emanuele Ballarin, Alessio Ansuini, Luca Bortolussi

TL;DR

This work tackles adversarial robustness for image classification by merging adversarial training with representation-conditioned purification. It introduces CARSO, which leverages a classifier's internal activations to condition a stochastic purifier (a conditional VAE) that generates multiple clean reconstructions; the classifier then aggregates their predictions via a robust, normalised doubly-exponential logit product. Empirical results across CIFAR-10, CIFAR-100, and TinyImageNet-200 show substantial gains in end-to-end robustness against adaptive white-box attacks, at the cost of a modest drop in clean accuracy, outperforming both adversarial training and purification baselines under AutoAttack in most settings. The approach is end-to-end differentiable, respects gradient-based attack assumptions, and demonstrates a scalable path to integrating purification and adversarial training for improved reliability of visual systems.

Abstract

In this work, we propose a novel adversarial defence mechanism for image classification - CARSO - blending the paradigms of adversarial training and adversarial purification in a synergistic robustness-enhancing way. The method builds upon an adversarially-trained classifier, and learns to map its internal representation associated with a potentially perturbed input onto a distribution of tentative clean reconstructions. Multiple samples from such distribution are classified by the same adversarially-trained model, and a carefully chosen aggregation of its outputs finally constitutes the robust prediction of interest. Experimental evaluation by a well-established benchmark of strong adaptive attacks, across different image datasets, shows that CARSO is able to defend itself against adaptive end-to-end white-box attacks devised for stochastic defences. Paying a modest clean accuracy toll, our method improves by a significant margin the state-of-the-art for Cifar-10, Cifar-100, and TinyImageNet-200 $\ell_\infty$ robust classification accuracy against AutoAttack. Code, and instructions to obtain pre-trained models are available at: https://github.com/emaballarin/CARSO .

Blending adversarial training and representation-conditional purification via aggregation improves adversarial robustness

TL;DR

This work tackles adversarial robustness for image classification by merging adversarial training with representation-conditioned purification. It introduces CARSO, which leverages a classifier's internal activations to condition a stochastic purifier (a conditional VAE) that generates multiple clean reconstructions; the classifier then aggregates their predictions via a robust, normalised doubly-exponential logit product. Empirical results across CIFAR-10, CIFAR-100, and TinyImageNet-200 show substantial gains in end-to-end robustness against adaptive white-box attacks, at the cost of a modest drop in clean accuracy, outperforming both adversarial training and purification baselines under AutoAttack in most settings. The approach is end-to-end differentiable, respects gradient-based attack assumptions, and demonstrates a scalable path to integrating purification and adversarial training for improved reliability of visual systems.

Abstract

In this work, we propose a novel adversarial defence mechanism for image classification - CARSO - blending the paradigms of adversarial training and adversarial purification in a synergistic robustness-enhancing way. The method builds upon an adversarially-trained classifier, and learns to map its internal representation associated with a potentially perturbed input onto a distribution of tentative clean reconstructions. Multiple samples from such distribution are classified by the same adversarially-trained model, and a carefully chosen aggregation of its outputs finally constitutes the robust prediction of interest. Experimental evaluation by a well-established benchmark of strong adaptive attacks, across different image datasets, shows that CARSO is able to defend itself against adaptive end-to-end white-box attacks devised for stochastic defences. Paying a modest clean accuracy toll, our method improves by a significant margin the state-of-the-art for Cifar-10, Cifar-100, and TinyImageNet-200 robust classification accuracy against AutoAttack. Code, and instructions to obtain pre-trained models are available at: https://github.com/emaballarin/CARSO .
Paper Structure (45 sections, 4 equations, 1 figure, 19 tables)

This paper contains 45 sections, 4 equations, 1 figure, 19 tables.

Figures (1)

  • Figure 1: Schematic representation of the Carso architecture used in the experimental phase of this work. The subnetwork bordered by the red dashed line is used only during the training of the purifier. The subnetwork bordered by the blue dashed line is re-evaluated on different random samples $\boldsymbol{z}_i$ and the resulting individual $\hat{y}_i$ are aggregated into $\hat{y}_{\text{rob}}$. The classifier$f(\cdot; \boldsymbol{\theta})$ is always kept frozen; the remaining network is trained on $\mathcal{L}_{\text{VAE}}(\boldsymbol{x}, \hat{\boldsymbol{x}})$. More precise details on the functioning of the networks are provided in \ref{['ssec:archoverview']}.