Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception

Ziqi Pang; Xin Xu; Yu-Xiong Wang

Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception

Ziqi Pang, Xin Xu, Yu-Xiong Wang

TL;DR

This work addresses the mismatch between generative diffusion denoising and discriminative visual perception tasks. It introduces ADDP, a triad of [contribution-aware timestep sampling, diffusion-tailored data augmentation, and correctional guidance interaction], to align the diffusion process with perception objectives. Across depth estimation, RIS, and generalist perception, ADDP delivers state-of-the-art or near-discriminative performance without architectural changes, and demonstrates the practical benefit of interactive prompts for multi-round reasoning. The results establish the diffusion denoising process as a controllable, interactive interface for perception with broad applicability and potential for agentic workflows.

Abstract

With the success of image generation, generative diffusion models are increasingly adopted for discriminative tasks, as pixel generation provides a unified perception interface. However, directly repurposing the generative denoising process for discriminative objectives reveals critical gaps rarely addressed previously. Generative models tolerate intermediate sampling errors if the final distribution remains plausible, but discriminative tasks require rigorous accuracy throughout, as evidenced in challenging multi-modal tasks like referring image segmentation. Motivated by this gap, we analyze and enhance alignment between generative diffusion processes and perception tasks, focusing on how perception quality evolves during denoising. We find: (1) earlier denoising steps contribute disproportionately to perception quality, prompting us to propose tailored learning objectives reflecting varying timestep contributions; (2) later denoising steps show unexpected perception degradation, highlighting sensitivity to training-denoising distribution shifts, addressed by our diffusion-tailored data augmentation; and (3) generative processes uniquely enable interactivity, serving as controllable user interfaces adaptable to correctional prompts in multi-round interactions. Our insights significantly improve diffusion-based perception models without architectural changes, achieving state-of-the-art performance on depth estimation, referring image segmentation, and generalist perception tasks. Code available at https://github.com/ziqipang/ADDP.

Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception

TL;DR

Abstract

Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)