Table of Contents
Fetching ...

Attack Deterministic Conditional Image Generative Models for Diverse and Controllable Generation

Tianyi Chu, Wei Xing, Jiafu Chen, Zhizhong Wang, Jiakai Sun, Lei Zhao, Haibo Chen, Huaizhong Lin

TL;DR

This work re-examine the conditional image generation tasks from the perspective of adversarial attack and proposes a simple and efficient plug-in projected gradient descent (PGD) like method, which opens the door to applying adversarial attack to low-level vision tasks.

Abstract

Existing generative adversarial network (GAN) based conditional image generative models typically produce fixed output for the same conditional input, which is unreasonable for highly subjective tasks, such as large-mask image inpainting or style transfer. On the other hand, GAN-based diverse image generative methods require retraining/fine-tuning the network or designing complex noise injection functions, which is computationally expensive, task-specific, or struggle to generate high-quality results. Given that many deterministic conditional image generative models have been able to produce high-quality yet fixed results, we raise an intriguing question: is it possible for pre-trained deterministic conditional image generative models to generate diverse results without changing network structures or parameters? To answer this question, we re-examine the conditional image generation tasks from the perspective of adversarial attack and propose a simple and efficient plug-in projected gradient descent (PGD) like method for diverse and controllable image generation. The key idea is attacking the pre-trained deterministic generative models by adding a micro perturbation to the input condition. In this way, diverse results can be generated without any adjustment of network structures or fine-tuning of the pre-trained models. In addition, we can also control the diverse results to be generated by specifying the attack direction according to a reference text or image. Our work opens the door to applying adversarial attack to low-level vision tasks, and experiments on various conditional image generation tasks demonstrate the effectiveness and superiority of the proposed method.

Attack Deterministic Conditional Image Generative Models for Diverse and Controllable Generation

TL;DR

This work re-examine the conditional image generation tasks from the perspective of adversarial attack and proposes a simple and efficient plug-in projected gradient descent (PGD) like method, which opens the door to applying adversarial attack to low-level vision tasks.

Abstract

Existing generative adversarial network (GAN) based conditional image generative models typically produce fixed output for the same conditional input, which is unreasonable for highly subjective tasks, such as large-mask image inpainting or style transfer. On the other hand, GAN-based diverse image generative methods require retraining/fine-tuning the network or designing complex noise injection functions, which is computationally expensive, task-specific, or struggle to generate high-quality results. Given that many deterministic conditional image generative models have been able to produce high-quality yet fixed results, we raise an intriguing question: is it possible for pre-trained deterministic conditional image generative models to generate diverse results without changing network structures or parameters? To answer this question, we re-examine the conditional image generation tasks from the perspective of adversarial attack and propose a simple and efficient plug-in projected gradient descent (PGD) like method for diverse and controllable image generation. The key idea is attacking the pre-trained deterministic generative models by adding a micro perturbation to the input condition. In this way, diverse results can be generated without any adjustment of network structures or fine-tuning of the pre-trained models. In addition, we can also control the diverse results to be generated by specifying the attack direction according to a reference text or image. Our work opens the door to applying adversarial attack to low-level vision tasks, and experiments on various conditional image generation tasks demonstrate the effectiveness and superiority of the proposed method.
Paper Structure (15 sections, 6 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 15 sections, 6 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: Three mainstream methods for introducing diversity in conditional image generation: (a) injecting noise through modulation module, (b) Gibbs sampling on feature sequences, and (c) transforming features according to specific rules. Our proposed method (d) allows pre-trained deterministic generative models to generate diverse results without multiple-step sampling, sophisticated transformation functions, or any adjustments to the network structure or parameters.
  • Figure 2: Diverse results generated by our method in conditional image generation tasks. We have tested upon two pre-trained deterministic models, including LaMa for image inpainting and StyTr$^{2}$ for style transfer. Random noise refers to adding standard Gaussian noise to the input. Untargeted refers to defining the attack direction to be as different from the default generated results as possible. +'' refers to specifying attack direction via text or reference image. (zoom-in for details)
  • Figure 3: Diverse face inpainting results generated by attacking the deterministic inpainting model LaMa. Generated results are compared with diverse inpainting model PIC and MAT.
  • Figure 4: Our method works well on super-resolution (well-posed vision task). SwinIR x4 model demonstrates sharper generated results via attack using "detailed" as the direction.
  • Figure 5: Left: Targeted diverse stylization. Top row: Content image, style image, and the default stylized result of StyTr$^2$. Second and third row: Text-guided stylization, compared with CLIPstyler clipstyler_kwon2022clipstyler which also uses CLIP for guidance. The default stylized image in row one is used as the input of CLIPstyler. Our method faithfully preserves the color characteristics of the style image. CLIPStyler requires fine-tuning the reconstruction model, which takes several minutes, while our method can complete each step of attack within 0.2 second. Right: Untargeted diverse stylization. Compared with DivSwapper and $\epsilon$-AE, our method generates higher diversity with better quality.
  • ...and 2 more figures