Table of Contents
Fetching ...

Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception

Ziqi Pang, Xin Xu, Yu-Xiong Wang

TL;DR

This work addresses the mismatch between generative diffusion denoising and discriminative visual perception tasks. It introduces ADDP, a triad of [contribution-aware timestep sampling, diffusion-tailored data augmentation, and correctional guidance interaction], to align the diffusion process with perception objectives. Across depth estimation, RIS, and generalist perception, ADDP delivers state-of-the-art or near-discriminative performance without architectural changes, and demonstrates the practical benefit of interactive prompts for multi-round reasoning. The results establish the diffusion denoising process as a controllable, interactive interface for perception with broad applicability and potential for agentic workflows.

Abstract

With the success of image generation, generative diffusion models are increasingly adopted for discriminative tasks, as pixel generation provides a unified perception interface. However, directly repurposing the generative denoising process for discriminative objectives reveals critical gaps rarely addressed previously. Generative models tolerate intermediate sampling errors if the final distribution remains plausible, but discriminative tasks require rigorous accuracy throughout, as evidenced in challenging multi-modal tasks like referring image segmentation. Motivated by this gap, we analyze and enhance alignment between generative diffusion processes and perception tasks, focusing on how perception quality evolves during denoising. We find: (1) earlier denoising steps contribute disproportionately to perception quality, prompting us to propose tailored learning objectives reflecting varying timestep contributions; (2) later denoising steps show unexpected perception degradation, highlighting sensitivity to training-denoising distribution shifts, addressed by our diffusion-tailored data augmentation; and (3) generative processes uniquely enable interactivity, serving as controllable user interfaces adaptable to correctional prompts in multi-round interactions. Our insights significantly improve diffusion-based perception models without architectural changes, achieving state-of-the-art performance on depth estimation, referring image segmentation, and generalist perception tasks. Code available at https://github.com/ziqipang/ADDP.

Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception

TL;DR

This work addresses the mismatch between generative diffusion denoising and discriminative visual perception tasks. It introduces ADDP, a triad of [contribution-aware timestep sampling, diffusion-tailored data augmentation, and correctional guidance interaction], to align the diffusion process with perception objectives. Across depth estimation, RIS, and generalist perception, ADDP delivers state-of-the-art or near-discriminative performance without architectural changes, and demonstrates the practical benefit of interactive prompts for multi-round reasoning. The results establish the diffusion denoising process as a controllable, interactive interface for perception with broad applicability and potential for agentic workflows.

Abstract

With the success of image generation, generative diffusion models are increasingly adopted for discriminative tasks, as pixel generation provides a unified perception interface. However, directly repurposing the generative denoising process for discriminative objectives reveals critical gaps rarely addressed previously. Generative models tolerate intermediate sampling errors if the final distribution remains plausible, but discriminative tasks require rigorous accuracy throughout, as evidenced in challenging multi-modal tasks like referring image segmentation. Motivated by this gap, we analyze and enhance alignment between generative diffusion processes and perception tasks, focusing on how perception quality evolves during denoising. We find: (1) earlier denoising steps contribute disproportionately to perception quality, prompting us to propose tailored learning objectives reflecting varying timestep contributions; (2) later denoising steps show unexpected perception degradation, highlighting sensitivity to training-denoising distribution shifts, addressed by our diffusion-tailored data augmentation; and (3) generative processes uniquely enable interactivity, serving as controllable user interfaces adaptable to correctional prompts in multi-round interactions. Our insights significantly improve diffusion-based perception models without architectural changes, achieving state-of-the-art performance on depth estimation, referring image segmentation, and generalist perception tasks. Code available at https://github.com/ziqipang/ADDP.

Paper Structure

This paper contains 39 sections, 9 equations, 13 figures, 13 tables, 2 algorithms.

Figures (13)

  • Figure 1: We demonstrate the gaps between a generative denoising process and perception tasks using referring image segmentation (RIS), where the diffusion model learns to color the referred object with red masks. (a)(b) The perception quality (Intersection-over-Union, IoU) at intermediate denoising steps, which come from the same denoising trajectory, reveals the uneven contribution of timesteps and training-denoising distribution shift, addressed by our enhanced learning objective and training data. (c) We discover that the generative denoising process is also a unique user interface for discriminative perception, because of its capabilities to interact with the correctional guidance from users or foundation models.
  • Figure 2: Method overview. We align the generative diffusion models with perception tasks from learning objective, training data, and user interface. Notations follow DDPM ho2020denoising.
  • Figure 3: Evolution of $\delta_1$ (intuitively the "accuracy" for depth estimation) from Marigold ke2023repurposing shows smoother patterns than RIS. We copy the RIS curve from Fig. \ref{['fig:teaser']}a here for easier comparison.
  • Figure 4: Data augmentation of (a) Gaussian blurring for depth estimation, and (b) color / shape / location for RIS. We use large / small intensities of augmentations to simulate different scales of distribution shifts at the earlier / later steps of denoising.
  • Figure 5: Interacting with correctional guidance $D^-$.
  • ...and 8 more figures