Direct Consistency Optimization for Robust Customization of Text-to-Image Diffusion Models
Kyungmin Lee, Sangkyung Kwak, Kihyuk Sohn, Jinwoo Shin
TL;DR
This work introduces Direct Consistency Optimization (DCO) to robustly fine-tune text-to-image diffusion models using only a few reference images. By regularizing the learning process to minimize deviation from the pretrained model in latent space, DCO preserves prior knowledge while enabling new concept learning, and it can be paired with consistency-guided sampling to balance subject fidelity and image-text alignment. The method achieves superior Pareto frontiers compared with DreamBooth and prior preservation, supports merging of independently trained subject and style models, and improves both subject and style fidelity, including in 1-shot scenarios. The approach generalizes to both subject and style personalization and improves compositional generation, which has practical implications for personalized, controllable image synthesis without requiring additional data.
Abstract
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, can generate visuals with a high degree of consistency. However, such fine-tuned models are not robust; they often fail to compose with concepts of pretrained model or other fine-tuned models. To address this, we propose a novel fine-tuning objective, dubbed Direct Consistency Optimization, which controls the deviation between fine-tuning and pretrained models to retain the pretrained knowledge during fine-tuning. Through extensive experiments on subject and style customization, we demonstrate that our method positions itself on a superior Pareto frontier between subject (or style) consistency and image-text alignment over all previous baselines; it not only outperforms regular fine-tuning objective in image-text alignment, but also shows higher fidelity to the reference images than the method that fine-tunes with additional prior dataset. More importantly, the models fine-tuned with our method can be merged without interference, allowing us to generate custom subjects in a custom style by composing separately customized subject and style models. Notably, we show that our approach achieves better prompt fidelity and subject fidelity than those post-optimized for merging regular fine-tuned models.
