Table of Contents
Fetching ...

A Simple Approach to Unifying Diffusion-based Conditional Generation

Xirui Li, Charles Herrmann, Kelvin C. K. Chan, Yinxiao Li, Deqing Sun, Chao Ma, Ming-Hsuan Yang

TL;DR

The paper tackles unifying diffusion-based conditional generation tasks that hinge on correlations between image pairs by learning a joint distribution $p(\mathbf{x},\mathbf{y})$ with a simple, parameter-efficient two-branch diffusion framework. It introduces UniCon, a diffusion model that incorporates joint cross-attention and LoRA adapters, enabling versatile inference modes such as controllable generation, estimation, and joint generation from a single model with minimal training overhead (~15% extra parameters) and the ability to handle non-aligned or coarse conditioning. Empirical results show UniCon achieves comparable or better performance than specialized methods and previous unified approaches across multiple modalities (e.g., depth, edges, pose, identity) and supports multi-signal conditioning by combining models, albeit with some instability for loosely correlated pairs. The work demonstrates that large-scale diffusion models can be adapted with simple, flexible training and sampling strategies to unify diverse conditional generation tasks, offering practical implications for efficient, multi-task diffusion systems with broad real-world applicability.

Abstract

Recent progress in image generation has sparked research into controlling these models through condition signals, with various methods addressing specific challenges in conditional generation. Instead of proposing another specialized technique, we introduce a simple, unified framework to handle diverse conditional generation tasks involving a specific image-condition correlation. By learning a joint distribution over a correlated image pair (e.g. image and depth) with a diffusion model, our approach enables versatile capabilities via different inference-time sampling schemes, including controllable image generation (e.g. depth to image), estimation (e.g. image to depth), signal guidance, joint generation (image & depth), and coarse control. Previous attempts at unification often introduce significant complexity through multi-stage training, architectural modification, or increased parameter counts. In contrast, our simple formulation requires a single, computationally efficient training stage, maintains the standard model input, and adds minimal learned parameters (15% of the base model). Moreover, our model supports additional capabilities like non-spatially aligned and coarse conditioning. Extensive results show that our single model can produce comparable results with specialized methods and better results than prior unified methods. We also demonstrate that multiple models can be effectively combined for multi-signal conditional generation.

A Simple Approach to Unifying Diffusion-based Conditional Generation

TL;DR

The paper tackles unifying diffusion-based conditional generation tasks that hinge on correlations between image pairs by learning a joint distribution with a simple, parameter-efficient two-branch diffusion framework. It introduces UniCon, a diffusion model that incorporates joint cross-attention and LoRA adapters, enabling versatile inference modes such as controllable generation, estimation, and joint generation from a single model with minimal training overhead (~15% extra parameters) and the ability to handle non-aligned or coarse conditioning. Empirical results show UniCon achieves comparable or better performance than specialized methods and previous unified approaches across multiple modalities (e.g., depth, edges, pose, identity) and supports multi-signal conditioning by combining models, albeit with some instability for loosely correlated pairs. The work demonstrates that large-scale diffusion models can be adapted with simple, flexible training and sampling strategies to unify diverse conditional generation tasks, offering practical implications for efficient, multi-task diffusion systems with broad real-world applicability.

Abstract

Recent progress in image generation has sparked research into controlling these models through condition signals, with various methods addressing specific challenges in conditional generation. Instead of proposing another specialized technique, we introduce a simple, unified framework to handle diverse conditional generation tasks involving a specific image-condition correlation. By learning a joint distribution over a correlated image pair (e.g. image and depth) with a diffusion model, our approach enables versatile capabilities via different inference-time sampling schemes, including controllable image generation (e.g. depth to image), estimation (e.g. image to depth), signal guidance, joint generation (image & depth), and coarse control. Previous attempts at unification often introduce significant complexity through multi-stage training, architectural modification, or increased parameter counts. In contrast, our simple formulation requires a single, computationally efficient training stage, maintains the standard model input, and adds minimal learned parameters (15% of the base model). Moreover, our model supports additional capabilities like non-spatially aligned and coarse conditioning. Extensive results show that our single model can produce comparable results with specialized methods and better results than prior unified methods. We also demonstrate that multiple models can be effectively combined for multi-signal conditional generation.

Paper Structure

This paper contains 18 sections, 8 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: The proposed UniCon supports diverse generation behavior in one model for a targeted type of image and condition. UniCon also offers flexible conditional generation ability with natural support for free-form input and seamless integration of multiple models.
  • Figure 2: UniCon pipeline. Given a pair of image-condition inputs, our UniCon model processes them concurrently in two parallel branches, with injected joint cross-attention modules where features from two branches attend to each other. We use LoRA weights to adapt our model from a pretrained diffusion model. During training, we separately sample timesteps for each input and compute loss over both branches.
  • Figure 2: Quantitative depth estimation comparison. We compare MiDaS ranftl2020towards, DPT ranftl2021vision, Marigold ke2023repurposing, and our Depth-Metric model on zero-shot depth estimation benchmarks. We show results without test-time ensembling.
  • Figure 3: Qualitative comparison of diverse Image-Depth generation tasks. We compare our single UniCon-Depth model with other specialized methods and a previous unified method JointNet zhang2023jointnet on diverse generation tasks.
  • Figure 3: Ablation of training setting and model alternatives. We evaluate the conditional generation performance of our Depth model under different settings.
  • ...and 10 more figures