Assessing Neural Network Robustness via Adversarial Pivotal Tuning

Peter Ebert Christensen; Vésteinn Snæbjarnarson; Andrea Dittadi; Serge Belongie; Sagie Benaim

Assessing Neural Network Robustness via Adversarial Pivotal Tuning

Peter Ebert Christensen, Vésteinn Snæbjarnarson, Andrea Dittadi, Serge Belongie, Sagie Benaim

TL;DR

This work addresses the gap in robustness evaluation by using a pretrained image generator to create highly expressive, class-preserving semantic manipulations of real images. The authors introduce Adversarial Pivotal Tuning (APT), which first performs latent inversion to obtain a pivot $w_p$ that reconstructs an input via $G_s$, then fine-tunes the generator weights $\hat{\theta}$ to produce semantic edits that fool a pretrained classifier while enforcing a perceptual bound $L_{pt} \le d$ to stay on the data manifold. Empirically, APT manipulations significantly reduce accuracy across multiple architectures and transfer between models, even for classifiers robust to standard benchmarks; ablations show each objective contributes to realism and fooling, while adversarial training with APT-generated images can improve robustness to this attack type. The results reveal that current robustness benchmarks are insufficient against fully expressive generator-based manipulations and demonstrate a practical use of APT as both an attack and a data-augmentation strategy to bolster resilience, with code provided for replication.

Abstract

The robustness of image classifiers is essential to their deployment in the real world. The ability to assess this resilience to manipulations or deviations from the training data is thus crucial. These modifications have traditionally consisted of minimal changes that still manage to fool classifiers, and modern approaches are increasingly robust to them. Semantic manipulations that modify elements of an image in meaningful ways have thus gained traction for this purpose. However, they have primarily been limited to style, color, or attribute changes. While expressive, these manipulations do not make use of the full capabilities of a pretrained generative model. In this work, we aim to bridge this gap. We show how a pretrained image generator can be used to semantically manipulate images in a detailed, diverse, and photorealistic way while still preserving the class of the original image. Inspired by recent GAN-based image inversion methods, we propose a method called Adversarial Pivotal Tuning (APT). Given an image, APT first finds a pivot latent space input that reconstructs the image using a pretrained generator. It then adjusts the generator's weights to create small yet semantic manipulations in order to fool a pretrained classifier. APT preserves the full expressive editing capabilities of the generative model. We demonstrate that APT is capable of a wide range of class-preserving semantic image manipulations that fool a variety of pretrained classifiers. Finally, we show that classifiers that are robust to other benchmarks are not robust to APT manipulations and suggest a method to improve them. Code available at: https://captaine.github.io/apt/

Assessing Neural Network Robustness via Adversarial Pivotal Tuning

TL;DR

that reconstructs an input via

, then fine-tunes the generator weights

to produce semantic edits that fool a pretrained classifier while enforcing a perceptual bound

to stay on the data manifold. Empirically, APT manipulations significantly reduce accuracy across multiple architectures and transfer between models, even for classifiers robust to standard benchmarks; ablations show each objective contributes to realism and fooling, while adversarial training with APT-generated images can improve robustness to this attack type. The results reveal that current robustness benchmarks are insufficient against fully expressive generator-based manipulations and demonstrate a practical use of APT as both an attack and a data-augmentation strategy to bolster resilience, with code provided for replication.

Abstract

Paper Structure (8 sections, 5 equations, 5 figures, 6 tables)

This paper contains 8 sections, 5 equations, 5 figures, 6 tables.

Introduction
Related Work
Adversarial Pivotal Tuning
Experiments
Adversarially Generated Manipulations
Improving Robustness to Generated Samples
Ablation Study
Conclusion

Figures (5)

Figure 1: Generated manipulations.Row 1 shows the original images. Row 2 shows our manipulations (using a distance of 0.2). Row 3 shows the result of lin2020dual, using pixel-space adversarial manipulations applied to StyleGAN-XL's reconstructions. Row 4 shows the result of lin2020dual with latent space manipulates applied using StyleGAN-XL. Row 5 applies our method using a diffusion-based generative model instead of StyleGAN-XL. Row 6 shows adversarially generated samples by song2018constructing using StyleGAN-XL which are non-realistic and not class-preserving. Our method manipulates images in a non-trivial but class-preserving manner, using the full capacity of the pretrained StyleGAN-XL generator. For example, it removes the eye of the mantis (second column), changes the type of race car (third column), changes the color of the crab tail (fifth column), removes the text on an airship (seventh column), and removes some of the ropes (eighth column). All of these are class-preserving examples that fool a pretrained PRIME-ResNet50 PRIME2021 classifier. In contrast, lin2020dual either generates noisy and less realistic images (row 3) or images which differ significantly semantically and which do not preserve the input class (row 4).
Figure 2: The Adversarial Pivotal Tuning (APT) framework. In the first step, we optimize a style code $w_p$ using standard latent optimization $\mathcal{L}_o$ from \ref{['eq:inversion']}, while keeping the generator $G$ frozen. The loss is computed between the ground-truth image $x_{gtr}$ and the generated image $x_{gen}$. In the second step, we freeze $w_p$ and finetune $G$ (shown in red) using the three objectives from \ref{['eq:loss']}; a reconstruction objective $\mathcal{L}_{rec}$, the projected GAN objective using the discriminator $D$, $\mathcal{L}_{PG}$, and our fooling objective $\mathcal{L}_{CE}$ using the classifier $C$. A $*$ is used to indicate a frozen component.
Figure 3: Generated manipulations. Top row shows input images. The middle row shows APT manipulations for a ResNet-50 classifier, and the bottom row shows APT manipulations from a FAN-VIT classifier. The leftmost image of a dog and the subsequent images including the image of a butterfly and column 7 (Fluffy dog) show similar manipulations for both classifiers, column 5-6 shows texture and spatial manipulations, the last column showcase a fooling image without a clear APT manipulation.
Figure 5: Illustration of APT's attack as well as the combination of APT with noise-based attacks (PDF+SSAH).
Figure 6: APT generation for various distance $d$ cutoff values. The leftmost image shows the input image. We increase the maximum distance $d$ to $0.2$, $0.3$ and $0.4$ respectively, for a PRIME-ResNet50 classifier.

Assessing Neural Network Robustness via Adversarial Pivotal Tuning

TL;DR

Abstract

Assessing Neural Network Robustness via Adversarial Pivotal Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)