Adding Additional Control to One-Step Diffusion with Joint Distribution Matching
Yihong Luo, Tianyang Hu, Yifan Song, Jiacheng Sun, Zhenguo Li, Jing Tang
TL;DR
Joint Distribution Matching (JDM) tackles the challenge of injecting new controls into one-step diffusion without retraining the base model by minimizing the reverse $KL$ divergence between image-condition joint distributions. By deriving a tractable upper bound, JDM decouples fidelity learning from condition learning, enabling asymmetric teacher-student distillation where the student can handle controls unknown to the teacher and leverage decoupled CFG or human feedback learning. Empirically, JDM outperforms multi-step controllable diffusion baselines with only one-step generation and achieves state-of-the-art results in one-step text-to-image synthesis when using enhanced CFG or HFL, while significantly reducing computational cost (e.g., from 50 NFEs to 1). The approach also supports shared generators across controls via a two-phase warm-up and employs a fake score and Consistency Model with LoRA to efficiently model conditional distributions, offering a practical, scalable path for adaptable, high-quality controllable diffusion.
Abstract
While diffusion distillation has enabled one-step generation through methods like Variational Score Distillation, adapting distilled models to emerging new controls -- such as novel structural constraints or latest user preferences -- remains challenging. Conventional approaches typically requires modifying the base diffusion model and redistilling it -- a process that is both computationally intensive and time-consuming. To address these challenges, we introduce Joint Distribution Matching (JDM), a novel approach that minimizes the reverse KL divergence between image-condition joint distributions. By deriving a tractable upper bound, JDM decouples fidelity learning from condition learning. This asymmetric distillation scheme enables our one-step student to handle controls unknown to the teacher model and facilitates improved classifier-free guidance (CFG) usage and seamless integration of human feedback learning (HFL). Experimental results demonstrate that JDM surpasses baseline methods such as multi-step ControlNet by mere one-step in most cases, while achieving state-of-the-art performance in one-step text-to-image synthesis through improved usage of CFG or HFL integration.
