Table of Contents
Fetching ...

ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

Ming Li, Taojiannan Yang, Huafeng Kuang, Jie Wu, Zhaoning Wang, Xuefeng Xiao, Chen Chen

TL;DR

This work tackles the challenge of achieving precise controllability in text-to-image diffusion by introducing ControlNet++, which explicitly optimizes pixel-level cycle consistency between input conditions and generated images using a pre-trained discriminative reward model. It couples this containment with an efficient reward-fine-tuning strategy that perturbs inputs and relies on single-step denoising to compute the consistency loss, dramatically reducing memory and compute costs. Across segmentation, edge, and depth controls, the method yields substantial improvements in controllability metrics (e.g., mIoU) while preserving image quality and text alignment (FID, CLIP). The approach is validated through extensive experiments and ablations, with open-source code and data, offering a practical path to more reliable conditional diffusion generation. Overall, ControlNet++ provides a principled, scalable framework for explicit controllability via cycle-consistency feedback in diffusion models.

Abstract

To enhance the controllability of text-to-image diffusion models, existing efforts like ControlNet incorporated image-based conditional controls. In this paper, we reveal that existing methods still face significant challenges in generating images that align with the image conditional controls. To this end, we propose ControlNet++, a novel approach that improves controllable generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls. Specifically, for an input conditional control, we use a pre-trained discriminative reward model to extract the corresponding condition of the generated images, and then optimize the consistency loss between the input conditional control and extracted condition. A straightforward implementation would be generating images from random noises and then calculating the consistency loss, but such an approach requires storing gradients for multiple sampling timesteps, leading to considerable time and memory costs. To address this, we introduce an efficient reward strategy that deliberately disturbs the input images by adding noise, and then uses the single-step denoised images for reward fine-tuning. This avoids the extensive costs associated with image sampling, allowing for more efficient reward fine-tuning. Extensive experiments show that ControlNet++ significantly improves controllability under various conditional controls. For example, it achieves improvements over ControlNet by 11.1% mIoU, 13.4% SSIM, and 7.6% RMSE, respectively, for segmentation mask, line-art edge, and depth conditions. All the code, models, demo and organized data have been open sourced on our Github Repo.

ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

TL;DR

This work tackles the challenge of achieving precise controllability in text-to-image diffusion by introducing ControlNet++, which explicitly optimizes pixel-level cycle consistency between input conditions and generated images using a pre-trained discriminative reward model. It couples this containment with an efficient reward-fine-tuning strategy that perturbs inputs and relies on single-step denoising to compute the consistency loss, dramatically reducing memory and compute costs. Across segmentation, edge, and depth controls, the method yields substantial improvements in controllability metrics (e.g., mIoU) while preserving image quality and text alignment (FID, CLIP). The approach is validated through extensive experiments and ablations, with open-source code and data, offering a practical path to more reliable conditional diffusion generation. Overall, ControlNet++ provides a principled, scalable framework for explicit controllability via cycle-consistency feedback in diffusion models.

Abstract

To enhance the controllability of text-to-image diffusion models, existing efforts like ControlNet incorporated image-based conditional controls. In this paper, we reveal that existing methods still face significant challenges in generating images that align with the image conditional controls. To this end, we propose ControlNet++, a novel approach that improves controllable generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls. Specifically, for an input conditional control, we use a pre-trained discriminative reward model to extract the corresponding condition of the generated images, and then optimize the consistency loss between the input conditional control and extracted condition. A straightforward implementation would be generating images from random noises and then calculating the consistency loss, but such an approach requires storing gradients for multiple sampling timesteps, leading to considerable time and memory costs. To address this, we introduce an efficient reward strategy that deliberately disturbs the input images by adding noise, and then uses the single-step denoised images for reward fine-tuning. This avoids the extensive costs associated with image sampling, allowing for more efficient reward fine-tuning. Extensive experiments show that ControlNet++ significantly improves controllability under various conditional controls. For example, it achieves improvements over ControlNet by 11.1% mIoU, 13.4% SSIM, and 7.6% RMSE, respectively, for segmentation mask, line-art edge, and depth conditions. All the code, models, demo and organized data have been open sourced on our Github Repo.
Paper Structure (40 sections, 11 equations, 16 figures, 8 tables)

This paper contains 40 sections, 11 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: (a) Given the same input image condition and text prompt, (b) the extracted conditions of our generated images are more consistent with the inputs, (c,d) while other methods fail to achieve accurate controllable generation. SSIM scores measure the similarity between all input edge conditions and the extracted edge conditions. All the line edges are extracted by the same line detection model used by ControlNet controlnet.
  • Figure 1: Illustration of predicted image $x'_0$ at different timesteps $t$. A small timestep $t$ (i.e., small noise $\epsilon_t$) leads to more precise estimation $x'_0 \approx x_0$.
  • Figure 2: Illustration of the cycle consistency. We first prompt the diffusion model $\mathbb{G}$ to generate an image $x'_0$ based on the given image condition $c_v$ and text prompt $c_t$, then extract the corresponding image condition $\hat{c}_v$ from the generated image $x'_0$ using pre-trained discriminative models $\mathbb{D}$. The cycle consistency is defined as the similarity between the extracted condition $\hat{c}_v$ and input condition $c_v$.
  • Figure 2: Naively increasing the weight of image condition embedding compared to text condition embedding in exiting methods (i.e., ControlNet and T2I-Adapter) cannot improve controllability while ensuring image quality. The red boxes in the figures highlight areas where the generated image is inconsistent with the input conditions. Please note that we employ the same line detection model to extract conditions from images.
  • Figure 3: (a) Existing methods achieve implicit controllability by introducing image-based conditional control $c_v$ into the denoising process of diffusion models, with the guidance of latent-space denoising loss. (b) We utilize discriminative reward models $\mathbb{D}$ to explicitly optimize the controllability of $\mathbb{G}$ via pixel-level cycle consistency loss.
  • ...and 11 more figures