Table of Contents
Fetching ...

CPO: Condition Preference Optimization for Controllable Image Generation

Zonglin Lyu, Ming Li, Xinxin Liu, Chen Chen

TL;DR

This work introduces Condition Preference Optimization (CPO), a low-variance training objective for controllable diffusion-based image generation that learns from control signals rather than image-level winners. By optimizing preferences over control conditions $(oldsymbol{c}^w,oldsymbol{c}^l)$, CPO disentangles controllability from image quality, enabling training across arbitrary timesteps and reducing data-curation overhead. The method yields state-of-the-art controllability across tasks such as segmentation, human pose, and edge/depth maps while maintaining competitive FID/CLIP scores. Empirically, CPO achieves significant improvements over ControlNet++ with scalable data curation and favorable training efficiency, highlighting its practical impact for fine-grained controllable image synthesis.

Abstract

To enhance controllability in text-to-image generation, ControlNet introduces image-based control signals, while ControlNet++ improves pixel-level cycle consistency between generated images and the input control signal. To avoid the prohibitive cost of back-propagating through the sampling process, ControlNet++ optimizes only low-noise timesteps (e.g., $t < 200$) using a single-step approximation, which not only ignores the contribution of high-noise timesteps but also introduces additional approximation errors. A straightforward alternative for optimizing controllability across all timesteps is Direct Preference Optimization (DPO), a fine-tuning method that increases model preference for more controllable images ($I^{w}$) over less controllable ones ($I^{l}$). However, due to uncertainty in generative models, it is difficult to ensure that win--lose image pairs differ only in controllability while keeping other factors, such as image quality, fixed. To address this, we propose performing preference learning over control conditions rather than generated images. Specifically, we construct winning and losing control signals, $\mathbf{c}^{w}$ and $\mathbf{c}^{l}$, and train the model to prefer $\mathbf{c}^{w}$. This method, which we term \textit{Condition Preference Optimization} (CPO), eliminates confounding factors and yields a low-variance training objective. Our approach theoretically exhibits lower contrastive loss variance than DPO and empirically achieves superior results. Moreover, CPO requires less computation and storage for dataset curation. Extensive experiments show that CPO significantly improves controllability over the state-of-the-art ControlNet++ across multiple control types: over $10\%$ error rate reduction in segmentation, $70$--$80\%$ in human pose, and consistent $2$--$5\%$ reductions in edge and depth maps.

CPO: Condition Preference Optimization for Controllable Image Generation

TL;DR

This work introduces Condition Preference Optimization (CPO), a low-variance training objective for controllable diffusion-based image generation that learns from control signals rather than image-level winners. By optimizing preferences over control conditions , CPO disentangles controllability from image quality, enabling training across arbitrary timesteps and reducing data-curation overhead. The method yields state-of-the-art controllability across tasks such as segmentation, human pose, and edge/depth maps while maintaining competitive FID/CLIP scores. Empirically, CPO achieves significant improvements over ControlNet++ with scalable data curation and favorable training efficiency, highlighting its practical impact for fine-grained controllable image synthesis.

Abstract

To enhance controllability in text-to-image generation, ControlNet introduces image-based control signals, while ControlNet++ improves pixel-level cycle consistency between generated images and the input control signal. To avoid the prohibitive cost of back-propagating through the sampling process, ControlNet++ optimizes only low-noise timesteps (e.g., ) using a single-step approximation, which not only ignores the contribution of high-noise timesteps but also introduces additional approximation errors. A straightforward alternative for optimizing controllability across all timesteps is Direct Preference Optimization (DPO), a fine-tuning method that increases model preference for more controllable images () over less controllable ones (). However, due to uncertainty in generative models, it is difficult to ensure that win--lose image pairs differ only in controllability while keeping other factors, such as image quality, fixed. To address this, we propose performing preference learning over control conditions rather than generated images. Specifically, we construct winning and losing control signals, and , and train the model to prefer . This method, which we term \textit{Condition Preference Optimization} (CPO), eliminates confounding factors and yields a low-variance training objective. Our approach theoretically exhibits lower contrastive loss variance than DPO and empirically achieves superior results. Moreover, CPO requires less computation and storage for dataset curation. Extensive experiments show that CPO significantly improves controllability over the state-of-the-art ControlNet++ across multiple control types: over error rate reduction in segmentation, -- in human pose, and consistent -- reductions in edge and depth maps.

Paper Structure

This paper contains 17 sections, 28 equations, 18 figures, 9 tables.

Figures (18)

  • Figure 1: (a) Data generation Process of DPO. We generate 20 images and find images that are the best and the worst aligned with the input condition. The ImageReward (IR) score of the winner is 0.2 higher; otherwise, the example is filtered out. (b) Data generation process of CPO. Our process is much simpler. We generate one image to perturb the ground truth condition since the generation model is not perfectly controllable. (c) Examples on Pose and Canny. $\mathbf{c}^w$'s are the ground truth conditions. In the Pose example, the red circle indicates an artifact in the winning example of DPO. In the Canny example, the difference in Canny edge is too hard to discern in raw pixels in DPO, but our method directly compares conditions.
  • Figure 2: Illustration of ControlNet++.
  • Figure 3: (a) Training of DPO. DPO is trained to prefer $I^w$ over $I^l$. (b) Training of CPO. CPO is trained to prefer $\mathbf{c}^w$ over $\mathbf{c}^l$.
  • Figure 4: Qualitative Comparison in Controllability. Red boxes indicate the area where our method achieves better controllability.
  • Figure 5: Qualitative Comparison in Pose and Lineart.
  • ...and 13 more figures