Table of Contents
Fetching ...

Adding Additional Control to One-Step Diffusion with Joint Distribution Matching

Yihong Luo, Tianyang Hu, Yifan Song, Jiacheng Sun, Zhenguo Li, Jing Tang

TL;DR

Joint Distribution Matching (JDM) tackles the challenge of injecting new controls into one-step diffusion without retraining the base model by minimizing the reverse $KL$ divergence between image-condition joint distributions. By deriving a tractable upper bound, JDM decouples fidelity learning from condition learning, enabling asymmetric teacher-student distillation where the student can handle controls unknown to the teacher and leverage decoupled CFG or human feedback learning. Empirically, JDM outperforms multi-step controllable diffusion baselines with only one-step generation and achieves state-of-the-art results in one-step text-to-image synthesis when using enhanced CFG or HFL, while significantly reducing computational cost (e.g., from 50 NFEs to 1). The approach also supports shared generators across controls via a two-phase warm-up and employs a fake score and Consistency Model with LoRA to efficiently model conditional distributions, offering a practical, scalable path for adaptable, high-quality controllable diffusion.

Abstract

While diffusion distillation has enabled one-step generation through methods like Variational Score Distillation, adapting distilled models to emerging new controls -- such as novel structural constraints or latest user preferences -- remains challenging. Conventional approaches typically requires modifying the base diffusion model and redistilling it -- a process that is both computationally intensive and time-consuming. To address these challenges, we introduce Joint Distribution Matching (JDM), a novel approach that minimizes the reverse KL divergence between image-condition joint distributions. By deriving a tractable upper bound, JDM decouples fidelity learning from condition learning. This asymmetric distillation scheme enables our one-step student to handle controls unknown to the teacher model and facilitates improved classifier-free guidance (CFG) usage and seamless integration of human feedback learning (HFL). Experimental results demonstrate that JDM surpasses baseline methods such as multi-step ControlNet by mere one-step in most cases, while achieving state-of-the-art performance in one-step text-to-image synthesis through improved usage of CFG or HFL integration.

Adding Additional Control to One-Step Diffusion with Joint Distribution Matching

TL;DR

Joint Distribution Matching (JDM) tackles the challenge of injecting new controls into one-step diffusion without retraining the base model by minimizing the reverse divergence between image-condition joint distributions. By deriving a tractable upper bound, JDM decouples fidelity learning from condition learning, enabling asymmetric teacher-student distillation where the student can handle controls unknown to the teacher and leverage decoupled CFG or human feedback learning. Empirically, JDM outperforms multi-step controllable diffusion baselines with only one-step generation and achieves state-of-the-art results in one-step text-to-image synthesis when using enhanced CFG or HFL, while significantly reducing computational cost (e.g., from 50 NFEs to 1). The approach also supports shared generators across controls via a two-phase warm-up and employs a fake score and Consistency Model with LoRA to efficiently model conditional distributions, offering a practical, scalable path for adaptable, high-quality controllable diffusion.

Abstract

While diffusion distillation has enabled one-step generation through methods like Variational Score Distillation, adapting distilled models to emerging new controls -- such as novel structural constraints or latest user preferences -- remains challenging. Conventional approaches typically requires modifying the base diffusion model and redistilling it -- a process that is both computationally intensive and time-consuming. To address these challenges, we introduce Joint Distribution Matching (JDM), a novel approach that minimizes the reverse KL divergence between image-condition joint distributions. By deriving a tractable upper bound, JDM decouples fidelity learning from condition learning. This asymmetric distillation scheme enables our one-step student to handle controls unknown to the teacher model and facilitates improved classifier-free guidance (CFG) usage and seamless integration of human feedback learning (HFL). Experimental results demonstrate that JDM surpasses baseline methods such as multi-step ControlNet by mere one-step in most cases, while achieving state-of-the-art performance in one-step text-to-image synthesis through improved usage of CFG or HFL integration.

Paper Structure

This paper contains 16 sections, 1 theorem, 12 equations, 6 figures, 3 tables.

Key Result

Lemma 3.1

Suppose the condition $c$ is discrete, a upper bound of eq:joint_kl can be computed by:

Figures (6)

  • Figure 1: Visual comparison of different strategies of adding controls. The compared baselines include 1) the diffusion with integrated standard ControlNet (denoted as ControlNet), and 2) the integration of pre-trained standard ControlNet with Diff-Instruct's pre-trained one-step generator (denoted as DI + ControlNet). Notably, our method not only maintains computational efficiency but also surpasses the visual quality achieved by the standard ControlNet approach. While the standard ControlNet approach relies heavily on high Classifier-Free Guidance (CFG) to achieve high-quality generation, this dependency might introduce unwanted artifacts in the final samples.
  • Figure 2: The framework description of our proposed JDM.
  • Figure 3: The qualitative comparison of the proposed method and potential baselines in one-step controllable generation.
  • Figure 4: Qualitative comparisons on controllable generation across different control signals against competing methods.
  • Figure 5: Qualitative comparisons on text-to-image generation across different control signals against competing methods.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Lemma 3.1