Table of Contents
Fetching ...

DIVE: Inverting Conditional Diffusion Models for Discriminative Tasks

Yinqi Li, Hong Chang, Ruibing Hou, Shiguang Shan, Xilin Chen

TL;DR

This work addresses the challenge of performing discriminative tasks with pretrained diffusion models by inverting a conditional diffusion model conditioned on object layouts. It introduces a prior layout model and an optimization-based inversion in embedding space to compute the posterior $p(y|x) \propto p(x|y) p(y)$, enabling object detection without fine-tuning the generator. Empirically, the method (DIVE) achieves competitive object detection performance on COCO compared with basic discriminative detectors and substantially speeds up image classification compared to enumeration-based diffusion classifiers. The approach highlights the intrinsic discriminative capacity of pretrained generative models and suggests practical pathways for applying diffusion models to dense recognition tasks, with potential extensions to faster inversion and other dense tasks such as semantic segmentation.

Abstract

Diffusion models have shown remarkable progress in various generative tasks such as image and video generation. This paper studies the problem of leveraging pretrained diffusion models for performing discriminative tasks. Specifically, we extend the discriminative capability of pretrained frozen generative diffusion models from the classification task to the more complex object detection task, by "inverting" a pretrained layout-to-image diffusion model. To this end, a gradient-based discrete optimization approach for replacing the heavy prediction enumeration process, and a prior distribution model for making more accurate use of the Bayes' rule, are proposed respectively. Empirical results show that this method is on par with basic discriminative object detection baselines on COCO dataset. In addition, our method can greatly speed up the previous diffusion-based method for classification without sacrificing accuracy. Code and models are available at https://github.com/LiYinqi/DIVE .

DIVE: Inverting Conditional Diffusion Models for Discriminative Tasks

TL;DR

This work addresses the challenge of performing discriminative tasks with pretrained diffusion models by inverting a conditional diffusion model conditioned on object layouts. It introduces a prior layout model and an optimization-based inversion in embedding space to compute the posterior , enabling object detection without fine-tuning the generator. Empirically, the method (DIVE) achieves competitive object detection performance on COCO compared with basic discriminative detectors and substantially speeds up image classification compared to enumeration-based diffusion classifiers. The approach highlights the intrinsic discriminative capacity of pretrained generative models and suggests practical pathways for applying diffusion models to dense recognition tasks, with potential extensions to faster inversion and other dense tasks such as semantic segmentation.

Abstract

Diffusion models have shown remarkable progress in various generative tasks such as image and video generation. This paper studies the problem of leveraging pretrained diffusion models for performing discriminative tasks. Specifically, we extend the discriminative capability of pretrained frozen generative diffusion models from the classification task to the more complex object detection task, by "inverting" a pretrained layout-to-image diffusion model. To this end, a gradient-based discrete optimization approach for replacing the heavy prediction enumeration process, and a prior distribution model for making more accurate use of the Bayes' rule, are proposed respectively. Empirical results show that this method is on par with basic discriminative object detection baselines on COCO dataset. In addition, our method can greatly speed up the previous diffusion-based method for classification without sacrificing accuracy. Code and models are available at https://github.com/LiYinqi/DIVE .

Paper Structure

This paper contains 39 sections, 7 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Repurposing pretrained conditional diffusion models for discriminative tasks without tuning model parameters. $L_{\text{diffusion}}$ and $L_{\text{prior}}$ represent the training loss of the diffusion model ho2020denoising, which is the lower bound of the modeled data distribution $p(x|y)$ and $p(y)$.
  • Figure 2: Illustration of the training framework of layout-conditional image generation model (left) and prior layout model (right). Bounding boxes in image $x$ are for visualization only.
  • Figure 3: Using trained layout-conditional image generation model and the prior layout model for object detection.
  • Figure 4: Visualization of DIVE detection results with corresponding convergence optimization steps shown below, yielded from the monitor introduced in \ref{['sec:method_impl']}. The maximum optimization step for all images is set as a fixed number (2000) here for simplicity. The average convergence step of the test set is about 960. Images fed to the networks are in $256\times256$ resolution and shown here in their original aspect ratios for better visualization.
  • Figure 5: Visualization of the object detection results. Besides comparing DIVE with other generative baselines that use the same pretrained diffusion model as ours, we also show the influences of the prior model and in-vocabulary discrete optimization method. For these ablations, we show at the bottom some additional dropped objects (none-value-contained and illegal boxes) in the inverted sequence for clearer visualization of different methods' behavior. And on the right, we show generated images by feeding the full inverted sequence to the pretrained layout-to-image model. Zoom in for better visualization.