Table of Contents
Fetching ...

Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models

Shuhong Zheng, Zhipeng Bao, Ruoyu Zhao, Martial Hebert, Yu-Xiong Wang

TL;DR

This work introduces Diff-2-in-1, a unified diffusion-based framework that bridges multi-modal data generation and dense visual perception within a single model. By exploiting the diffusion-denoising process and a novel self-improving mechanism with two interplaying parameter sets, it generates faithful, diverse RGB-attribute data while enhancing discriminative tasks. The approach yields consistent gains across diverse backbones and tasks, and demonstrates data-efficient improvements via synthetic data generation and refinement. Overall, Diff-2-in-1 offers a versatile, data-efficient pathway to jointly advance generative and discriminative capabilities in dense visual understanding.

Abstract

Beyond high-fidelity image synthesis, diffusion models have recently exhibited promising results in dense visual perception tasks. However, most existing work treats diffusion models as a standalone component for perception tasks, employing them either solely for off-the-shelf data augmentation or as mere feature extractors. In contrast to these isolated and thus sub-optimal efforts, we introduce a unified, versatile, diffusion-based framework, Diff-2-in-1, that can simultaneously handle both multi-modal data generation and dense visual perception, through a unique exploitation of the diffusion-denoising process. Within this framework, we further enhance discriminative visual perception via multi-modal generation, by utilizing the denoising network to create multi-modal data that mirror the distribution of the original training set. Importantly, Diff-2-in-1 optimizes the utilization of the created diverse and faithful data by leveraging a novel self-improving learning mechanism. Comprehensive experimental evaluations validate the effectiveness of our framework, showcasing consistent performance improvements across various discriminative backbones and high-quality multi-modal data generation characterized by both realism and usefulness.

Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models

TL;DR

This work introduces Diff-2-in-1, a unified diffusion-based framework that bridges multi-modal data generation and dense visual perception within a single model. By exploiting the diffusion-denoising process and a novel self-improving mechanism with two interplaying parameter sets, it generates faithful, diverse RGB-attribute data while enhancing discriminative tasks. The approach yields consistent gains across diverse backbones and tasks, and demonstrates data-efficient improvements via synthetic data generation and refinement. Overall, Diff-2-in-1 offers a versatile, data-efficient pathway to jointly advance generative and discriminative capabilities in dense visual understanding.

Abstract

Beyond high-fidelity image synthesis, diffusion models have recently exhibited promising results in dense visual perception tasks. However, most existing work treats diffusion models as a standalone component for perception tasks, employing them either solely for off-the-shelf data augmentation or as mere feature extractors. In contrast to these isolated and thus sub-optimal efforts, we introduce a unified, versatile, diffusion-based framework, Diff-2-in-1, that can simultaneously handle both multi-modal data generation and dense visual perception, through a unique exploitation of the diffusion-denoising process. Within this framework, we further enhance discriminative visual perception via multi-modal generation, by utilizing the denoising network to create multi-modal data that mirror the distribution of the original training set. Importantly, Diff-2-in-1 optimizes the utilization of the created diverse and faithful data by leveraging a novel self-improving learning mechanism. Comprehensive experimental evaluations validate the effectiveness of our framework, showcasing consistent performance improvements across various discriminative backbones and high-quality multi-modal data generation characterized by both realism and usefulness.

Paper Structure

This paper contains 24 sections, 10 equations, 14 figures, 14 tables.

Figures (14)

  • Figure 1: A single, unified diffusion-based model for both generative and discriminative learning. If the model receives an RGB image as input, its function is to predict an accurate visual attribute map. Simultaneously, the model is equipped to produce photo-realistic and coherent multi-modal data sampled from Gaussian noise. We use depth as an example here for illustration, and the framework is also applicable to other visual attributes such as segmentation, surface normal, etc.
  • Figure 2: Our self-improving learning paradigm with two sets of interplayed parameters during training. The data creation parameter $\theta_\text{C}$ generates samples serving as additional training data for the data exploitation parameter $\theta_\text{E}$, while $\theta_\text{E}$ performs discriminative learning and provides guidance to update $\theta_\text{C}$ through exponential moving average. Finally, $\theta_\text{C}$ performs both discriminative and generative tasks during inference.
  • Figure 3: Real data samples from NYUv2 and synthesized samples generated from Gaussian noise. The distribution of the generated data varies from the real data distribution.
  • Figure 4: In-distribution data generation using partial noise. We generate in-distribution data by denoising from a noisy image at timestep $T$ with $0<T<T_\mathrm{max}$. A larger $T$ leads to greater diversity, whereas a smaller $T$ enhances the resemblance to the original distribution.
  • Figure 5: Ablation study on different data settings with our Diff-2-in-1. Green line: Performance of the baseline VPD. Yellow line: Performance with our Diff-2-in-1. Gray bars: Improvement in each data setting. Our Diff-2-in-1 could consistently bring performance gain for all different data settings with more benefits in mid-range data settings.
  • ...and 9 more figures