Table of Contents
Fetching ...

Direct Consistency Optimization for Robust Customization of Text-to-Image Diffusion Models

Kyungmin Lee, Sangkyung Kwak, Kihyuk Sohn, Jinwoo Shin

TL;DR

This work introduces Direct Consistency Optimization (DCO) to robustly fine-tune text-to-image diffusion models using only a few reference images. By regularizing the learning process to minimize deviation from the pretrained model in latent space, DCO preserves prior knowledge while enabling new concept learning, and it can be paired with consistency-guided sampling to balance subject fidelity and image-text alignment. The method achieves superior Pareto frontiers compared with DreamBooth and prior preservation, supports merging of independently trained subject and style models, and improves both subject and style fidelity, including in 1-shot scenarios. The approach generalizes to both subject and style personalization and improves compositional generation, which has practical implications for personalized, controllable image synthesis without requiring additional data.

Abstract

Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, can generate visuals with a high degree of consistency. However, such fine-tuned models are not robust; they often fail to compose with concepts of pretrained model or other fine-tuned models. To address this, we propose a novel fine-tuning objective, dubbed Direct Consistency Optimization, which controls the deviation between fine-tuning and pretrained models to retain the pretrained knowledge during fine-tuning. Through extensive experiments on subject and style customization, we demonstrate that our method positions itself on a superior Pareto frontier between subject (or style) consistency and image-text alignment over all previous baselines; it not only outperforms regular fine-tuning objective in image-text alignment, but also shows higher fidelity to the reference images than the method that fine-tunes with additional prior dataset. More importantly, the models fine-tuned with our method can be merged without interference, allowing us to generate custom subjects in a custom style by composing separately customized subject and style models. Notably, we show that our approach achieves better prompt fidelity and subject fidelity than those post-optimized for merging regular fine-tuned models.

Direct Consistency Optimization for Robust Customization of Text-to-Image Diffusion Models

TL;DR

This work introduces Direct Consistency Optimization (DCO) to robustly fine-tune text-to-image diffusion models using only a few reference images. By regularizing the learning process to minimize deviation from the pretrained model in latent space, DCO preserves prior knowledge while enabling new concept learning, and it can be paired with consistency-guided sampling to balance subject fidelity and image-text alignment. The method achieves superior Pareto frontiers compared with DreamBooth and prior preservation, supports merging of independently trained subject and style models, and improves both subject and style fidelity, including in 1-shot scenarios. The approach generalizes to both subject and style personalization and improves compositional generation, which has practical implications for personalized, controllable image synthesis without requiring additional data.

Abstract

Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, can generate visuals with a high degree of consistency. However, such fine-tuned models are not robust; they often fail to compose with concepts of pretrained model or other fine-tuned models. To address this, we propose a novel fine-tuning objective, dubbed Direct Consistency Optimization, which controls the deviation between fine-tuning and pretrained models to retain the pretrained knowledge during fine-tuning. Through extensive experiments on subject and style customization, we demonstrate that our method positions itself on a superior Pareto frontier between subject (or style) consistency and image-text alignment over all previous baselines; it not only outperforms regular fine-tuning objective in image-text alignment, but also shows higher fidelity to the reference images than the method that fine-tunes with additional prior dataset. More importantly, the models fine-tuned with our method can be merged without interference, allowing us to generate custom subjects in a custom style by composing separately customized subject and style models. Notably, we show that our approach achieves better prompt fidelity and subject fidelity than those post-optimized for merging regular fine-tuned models.
Paper Structure (36 sections, 22 equations, 22 figures, 3 tables, 2 algorithms)

This paper contains 36 sections, 22 equations, 22 figures, 3 tables, 2 algorithms.

Figures (22)

  • Figure 1: Overview. (a) Direct Consistency Optimization (DCO) pushes the Pareto frontier between prompt fidelity and subject fidelity towards upper-right over DreamBooth ruiz2023dreambooth, and with prior preservation loss (DreamBooth+p.p.). DCO improves generating custom subject with various visual attributes (e.g., astronaut outfits and background of Mars), or various styles that pretrained model knows (e.g., flat cartoon illustration style). (b) The customized subject and style models fine-tuned with DCO can be merged as is, allowing us to generate my subject in my stylesohn2023styledrop.
  • Figure 2: Comprehensive caption. We provide examples of compact caption ruiz2023dreambooth and our comprehensive caption (top row) and generated images from each method (bottom row). The model fine-tuned with compact caption (left) generates images of a dog sitting on a couch though asked to be on the lake. Our comprehensive caption (right) effectively disentangles unwanted attributes, generating images that follow text prompts more faithfully.
  • Figure 3: Custom subject generation. We show selected generations from DreamBooth (DB), DB with prior preservation (DB+p.p.), and ours (DCO) of custom subjects with varying attributes and styles guided by text prompts. While DB captures subjects well, it does not follow text prompt well. DB+p.p. shows better textual alignment, but falls short in subject fidelity. Ours show the best in both image-text alignment and subject fidelity. Best viewed in color, zoomed in on monitor.
  • Figure 4: Custom style generation. We show selected generations from DreamBooth (DB) and ours (DCO) of custom styles with varying subjects. DB is prone to capturing undesirable attributes, resulting in generation of mixed concepts (e.g., the girl's outfits in the first row, the dog in the second row), whereas DCO mitigates such a concept mixing. Best viewed in color, zoomed in on monitor.
  • Figure 5: Quantitative results. We plot Pareto curve between subject / style fidelity (image similarity) and prompt fidelity (image-text similarity) on (a) subject personalization and (b) style personalization tasks. Scores of each point are measured with consistency guidance sampling (dots and lines) of $\omega_{\textrm{con}}=2.0, 3.0, 4.0, 5.0$, and conventional classifier-free guidance sampling (diamond). See Sec. \ref{['sec:exp_subject']} and Sec. \ref{['sec:exp_style']} for experimental details, and Appendix \ref{['appendix:addexp_full']} for full comparison.
  • ...and 17 more figures