A Simple Approach to Unifying Diffusion-based Conditional Generation
Xirui Li, Charles Herrmann, Kelvin C. K. Chan, Yinxiao Li, Deqing Sun, Chao Ma, Ming-Hsuan Yang
TL;DR
The paper tackles unifying diffusion-based conditional generation tasks that hinge on correlations between image pairs by learning a joint distribution $p(\mathbf{x},\mathbf{y})$ with a simple, parameter-efficient two-branch diffusion framework. It introduces UniCon, a diffusion model that incorporates joint cross-attention and LoRA adapters, enabling versatile inference modes such as controllable generation, estimation, and joint generation from a single model with minimal training overhead (~15% extra parameters) and the ability to handle non-aligned or coarse conditioning. Empirical results show UniCon achieves comparable or better performance than specialized methods and previous unified approaches across multiple modalities (e.g., depth, edges, pose, identity) and supports multi-signal conditioning by combining models, albeit with some instability for loosely correlated pairs. The work demonstrates that large-scale diffusion models can be adapted with simple, flexible training and sampling strategies to unify diverse conditional generation tasks, offering practical implications for efficient, multi-task diffusion systems with broad real-world applicability.
Abstract
Recent progress in image generation has sparked research into controlling these models through condition signals, with various methods addressing specific challenges in conditional generation. Instead of proposing another specialized technique, we introduce a simple, unified framework to handle diverse conditional generation tasks involving a specific image-condition correlation. By learning a joint distribution over a correlated image pair (e.g. image and depth) with a diffusion model, our approach enables versatile capabilities via different inference-time sampling schemes, including controllable image generation (e.g. depth to image), estimation (e.g. image to depth), signal guidance, joint generation (image & depth), and coarse control. Previous attempts at unification often introduce significant complexity through multi-stage training, architectural modification, or increased parameter counts. In contrast, our simple formulation requires a single, computationally efficient training stage, maintains the standard model input, and adds minimal learned parameters (15% of the base model). Moreover, our model supports additional capabilities like non-spatially aligned and coarse conditioning. Extensive results show that our single model can produce comparable results with specialized methods and better results than prior unified methods. We also demonstrate that multiple models can be effectively combined for multi-signal conditional generation.
