CPO: Condition Preference Optimization for Controllable Image Generation

Zonglin Lyu; Ming Li; Xinxin Liu; Chen Chen

CPO: Condition Preference Optimization for Controllable Image Generation

Zonglin Lyu, Ming Li, Xinxin Liu, Chen Chen

TL;DR

This work introduces Condition Preference Optimization (CPO), a low-variance training objective for controllable diffusion-based image generation that learns from control signals rather than image-level winners. By optimizing preferences over control conditions $(oldsymbol{c}^w,oldsymbol{c}^l)$, CPO disentangles controllability from image quality, enabling training across arbitrary timesteps and reducing data-curation overhead. The method yields state-of-the-art controllability across tasks such as segmentation, human pose, and edge/depth maps while maintaining competitive FID/CLIP scores. Empirically, CPO achieves significant improvements over ControlNet++ with scalable data curation and favorable training efficiency, highlighting its practical impact for fine-grained controllable image synthesis.

Abstract

To enhance controllability in text-to-image generation, ControlNet introduces image-based control signals, while ControlNet++ improves pixel-level cycle consistency between generated images and the input control signal. To avoid the prohibitive cost of back-propagating through the sampling process, ControlNet++ optimizes only low-noise timesteps (e.g., $t < 200$) using a single-step approximation, which not only ignores the contribution of high-noise timesteps but also introduces additional approximation errors. A straightforward alternative for optimizing controllability across all timesteps is Direct Preference Optimization (DPO), a fine-tuning method that increases model preference for more controllable images ($I^{w}$) over less controllable ones ($I^{l}$). However, due to uncertainty in generative models, it is difficult to ensure that win--lose image pairs differ only in controllability while keeping other factors, such as image quality, fixed. To address this, we propose performing preference learning over control conditions rather than generated images. Specifically, we construct winning and losing control signals, $\mathbf{c}^{w}$ and $\mathbf{c}^{l}$, and train the model to prefer $\mathbf{c}^{w}$. This method, which we term \textit{Condition Preference Optimization} (CPO), eliminates confounding factors and yields a low-variance training objective. Our approach theoretically exhibits lower contrastive loss variance than DPO and empirically achieves superior results. Moreover, CPO requires less computation and storage for dataset curation. Extensive experiments show that CPO significantly improves controllability over the state-of-the-art ControlNet++ across multiple control types: over $10\%$ error rate reduction in segmentation, $70$--$80\%$ in human pose, and consistent $2$--$5\%$ reductions in edge and depth maps.

CPO: Condition Preference Optimization for Controllable Image Generation

TL;DR

, CPO disentangles controllability from image quality, enabling training across arbitrary timesteps and reducing data-curation overhead. The method yields state-of-the-art controllability across tasks such as segmentation, human pose, and edge/depth maps while maintaining competitive FID/CLIP scores. Empirically, CPO achieves significant improvements over ControlNet++ with scalable data curation and favorable training efficiency, highlighting its practical impact for fine-grained controllable image synthesis.

Abstract

) using a single-step approximation, which not only ignores the contribution of high-noise timesteps but also introduces additional approximation errors. A straightforward alternative for optimizing controllability across all timesteps is Direct Preference Optimization (DPO), a fine-tuning method that increases model preference for more controllable images (

) over less controllable ones (

). However, due to uncertainty in generative models, it is difficult to ensure that win--lose image pairs differ only in controllability while keeping other factors, such as image quality, fixed. To address this, we propose performing preference learning over control conditions rather than generated images. Specifically, we construct winning and losing control signals,

and

, and train the model to prefer

. This method, which we term \textit{Condition Preference Optimization} (CPO), eliminates confounding factors and yields a low-variance training objective. Our approach theoretically exhibits lower contrastive loss variance than DPO and empirically achieves superior results. Moreover, CPO requires less computation and storage for dataset curation. Extensive experiments show that CPO significantly improves controllability over the state-of-the-art ControlNet++ across multiple control types: over

error rate reduction in segmentation,

in human pose, and consistent

reductions in edge and depth maps.

CPO: Condition Preference Optimization for Controllable Image Generation

TL;DR

Abstract

CPO: Condition Preference Optimization for Controllable Image Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (18)