DIVE: Inverting Conditional Diffusion Models for Discriminative Tasks
Yinqi Li, Hong Chang, Ruibing Hou, Shiguang Shan, Xilin Chen
TL;DR
This work addresses the challenge of performing discriminative tasks with pretrained diffusion models by inverting a conditional diffusion model conditioned on object layouts. It introduces a prior layout model and an optimization-based inversion in embedding space to compute the posterior $p(y|x) \propto p(x|y) p(y)$, enabling object detection without fine-tuning the generator. Empirically, the method (DIVE) achieves competitive object detection performance on COCO compared with basic discriminative detectors and substantially speeds up image classification compared to enumeration-based diffusion classifiers. The approach highlights the intrinsic discriminative capacity of pretrained generative models and suggests practical pathways for applying diffusion models to dense recognition tasks, with potential extensions to faster inversion and other dense tasks such as semantic segmentation.
Abstract
Diffusion models have shown remarkable progress in various generative tasks such as image and video generation. This paper studies the problem of leveraging pretrained diffusion models for performing discriminative tasks. Specifically, we extend the discriminative capability of pretrained frozen generative diffusion models from the classification task to the more complex object detection task, by "inverting" a pretrained layout-to-image diffusion model. To this end, a gradient-based discrete optimization approach for replacing the heavy prediction enumeration process, and a prior distribution model for making more accurate use of the Bayes' rule, are proposed respectively. Empirical results show that this method is on par with basic discriminative object detection baselines on COCO dataset. In addition, our method can greatly speed up the previous diffusion-based method for classification without sacrificing accuracy. Code and models are available at https://github.com/LiYinqi/DIVE .
