Table of Contents
Fetching ...

Enhancing Diffusion-Based Image Synthesis with Robust Classifier Guidance

Bahjat Kawar, Roy Ganz, Michael Elad

TL;DR

This paper tackles the instability of gradient-based classifier guidance in diffusion models by introducing a time-dependent adversarially robust classifier to provide perceptually aligned gradients for diffusion guidance. The authors train this robust classifier with PGD-based adversarial attacks and early stopping, then integrate it into the diffusion sampling process for ImageNet, reporting improved FID and precision and a human-preference tilt toward the robust-guided outputs. They validate the approach with extensive experiments, ablations, and an opinion survey, demonstrating both quantitative gains and user-perceived improvements over vanilla guidance. The work highlights the importance of gradient quality in guidance signals and opens avenues for robust, perceptually informed conditioning in diffusion and related generative frameworks.

Abstract

Denoising diffusion probabilistic models (DDPMs) are a recent family of generative models that achieve state-of-the-art results. In order to obtain class-conditional generation, it was suggested to guide the diffusion process by gradients from a time-dependent classifier. While the idea is theoretically sound, deep learning-based classifiers are infamously susceptible to gradient-based adversarial attacks. Therefore, while traditional classifiers may achieve good accuracy scores, their gradients are possibly unreliable and might hinder the improvement of the generation results. Recent work discovered that adversarially robust classifiers exhibit gradients that are aligned with human perception, and these could better guide a generative process towards semantically meaningful images. We utilize this observation by defining and training a time-dependent adversarially robust classifier and use it as guidance for a generative diffusion model. In experiments on the highly challenging and diverse ImageNet dataset, our scheme introduces significantly more intelligible intermediate gradients, better alignment with theoretical findings, as well as improved generation results under several evaluation metrics. Furthermore, we conduct an opinion survey whose findings indicate that human raters prefer our method's results.

Enhancing Diffusion-Based Image Synthesis with Robust Classifier Guidance

TL;DR

This paper tackles the instability of gradient-based classifier guidance in diffusion models by introducing a time-dependent adversarially robust classifier to provide perceptually aligned gradients for diffusion guidance. The authors train this robust classifier with PGD-based adversarial attacks and early stopping, then integrate it into the diffusion sampling process for ImageNet, reporting improved FID and precision and a human-preference tilt toward the robust-guided outputs. They validate the approach with extensive experiments, ablations, and an opinion survey, demonstrating both quantitative gains and user-perceived improvements over vanilla guidance. The work highlights the importance of gradient quality in guidance signals and opens avenues for robust, perceptually informed conditioning in diffusion and related generative frameworks.

Abstract

Denoising diffusion probabilistic models (DDPMs) are a recent family of generative models that achieve state-of-the-art results. In order to obtain class-conditional generation, it was suggested to guide the diffusion process by gradients from a time-dependent classifier. While the idea is theoretically sound, deep learning-based classifiers are infamously susceptible to gradient-based adversarial attacks. Therefore, while traditional classifiers may achieve good accuracy scores, their gradients are possibly unreliable and might hinder the improvement of the generation results. Recent work discovered that adversarially robust classifiers exhibit gradients that are aligned with human perception, and these could better guide a generative process towards semantically meaningful images. We utilize this observation by defining and training a time-dependent adversarially robust classifier and use it as guidance for a generative diffusion model. In experiments on the highly challenging and diverse ImageNet dataset, our scheme introduces significantly more intelligible intermediate gradients, better alignment with theoretical findings, as well as improved generation results under several evaluation metrics. Furthermore, we conduct an opinion survey whose findings indicate that human raters prefer our method's results.
Paper Structure (22 sections, 8 equations, 8 figures, 6 tables)

This paper contains 22 sections, 8 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Images generated with our proposed method.
  • Figure 2: Images generated by guided diffusion using the same random seed and class label, with a vanilla (top) and a robust (bottom) classifier. Our robust model provides more informative gradients, leading to better synthesis quality.
  • Figure 3: Gradients of images on their respective true class labels, using a vanilla classifier and our robust one at different timesteps. Gradients are min-max normalized.
  • Figure 4: Maximizing the probability of target classes with given images using classifier gradients (at $t=0$). Our robust classifier leads to images with less adversarial noise, and more aligned with the target class.
  • Figure 5: Approximations of the final image at uniformly spaced intermediate steps of the guided diffusion process, for the same class and the same random seed. Our robust classifier provides better guidance.
  • ...and 3 more figures