Table of Contents
Fetching ...

DreamBlend: Advancing Personalized Fine-tuning of Text-to-Image Diffusion Models

Shwetha Ram, Tal Neiman, Qianli Feng, Andrew Stuart, Son Tran, Trishul Chilimbi

TL;DR

Personalization of text-to-image diffusion with limited subject data suffers from a trade-off among prompt fidelity, subject fidelity, and diversity. DreamBlend performs inference-time fusion, using an underfit early checkpoint to generate a prompt-faithful reference image and a later overfit checkpoint to inject subject fidelity, guided by cross-attention alignment. The approach identifies catastrophic attention collapse in later checkpoints, introduces cross-attention guidance regularization to align attention maps across checkpoints, and demonstrates state-of-the-art improvements on the DreamBooth benchmark. This technique yields higher fidelity and more diverse personalized generations with minimal retraining, enhancing practical usability while highlighting safety considerations.

Abstract

Given a small number of images of a subject, personalized image generation techniques can fine-tune large pre-trained text-to-image diffusion models to generate images of the subject in novel contexts, conditioned on text prompts. In doing so, a trade-off is made between prompt fidelity, subject fidelity and diversity. As the pre-trained model is fine-tuned, earlier checkpoints synthesize images with low subject fidelity but high prompt fidelity and diversity. In contrast, later checkpoints generate images with low prompt fidelity and diversity but high subject fidelity. This inherent trade-off limits the prompt fidelity, subject fidelity and diversity of generated images. In this work, we propose DreamBlend to combine the prompt fidelity from earlier checkpoints and the subject fidelity from later checkpoints during inference. We perform a cross attention guided image synthesis from a later checkpoint, guided by an image generated by an earlier checkpoint, for the same prompt. This enables generation of images with better subject fidelity, prompt fidelity and diversity on challenging prompts, outperforming state-of-the-art fine-tuning methods.

DreamBlend: Advancing Personalized Fine-tuning of Text-to-Image Diffusion Models

TL;DR

Personalization of text-to-image diffusion with limited subject data suffers from a trade-off among prompt fidelity, subject fidelity, and diversity. DreamBlend performs inference-time fusion, using an underfit early checkpoint to generate a prompt-faithful reference image and a later overfit checkpoint to inject subject fidelity, guided by cross-attention alignment. The approach identifies catastrophic attention collapse in later checkpoints, introduces cross-attention guidance regularization to align attention maps across checkpoints, and demonstrates state-of-the-art improvements on the DreamBooth benchmark. This technique yields higher fidelity and more diverse personalized generations with minimal retraining, enhancing practical usability while highlighting safety considerations.

Abstract

Given a small number of images of a subject, personalized image generation techniques can fine-tune large pre-trained text-to-image diffusion models to generate images of the subject in novel contexts, conditioned on text prompts. In doing so, a trade-off is made between prompt fidelity, subject fidelity and diversity. As the pre-trained model is fine-tuned, earlier checkpoints synthesize images with low subject fidelity but high prompt fidelity and diversity. In contrast, later checkpoints generate images with low prompt fidelity and diversity but high subject fidelity. This inherent trade-off limits the prompt fidelity, subject fidelity and diversity of generated images. In this work, we propose DreamBlend to combine the prompt fidelity from earlier checkpoints and the subject fidelity from later checkpoints during inference. We perform a cross attention guided image synthesis from a later checkpoint, guided by an image generated by an earlier checkpoint, for the same prompt. This enables generation of images with better subject fidelity, prompt fidelity and diversity on challenging prompts, outperforming state-of-the-art fine-tuning methods.

Paper Structure

This paper contains 21 sections, 19 figures, 4 tables, 1 algorithm.

Figures (19)

  • Figure 1: Images generated by different checkpoints for prompt: 'a $backpack^*$ on a cobblestone street' as a SD model is fine-tuned from 5 to 1000 steps. Early checkpoints have higher prompt fidelity and diversity but lower subject fidelity while later checkpoints have higher subject fidelity but lower prompt fidelity and diversity. At step=1000, the model reproduces the input images used for fine-tuning.
  • Figure 2: Attention Guidance and Attention Collapse: Images generated at 5, 250, and 1000 steps of DreamBooth fine-tuning, with text-image cross attention maps for $backpack^*$ in \ref{['fig:backpacking-overfitting']}. The prompt "a sks backpack on a cobblestone street" features the rare token "sks" as $backpack^*$. Step 250 + CAG: Cross attention guidance (CAG) from step 5 image is effective. The resulting image maintains layout of step 5 image, while preserving subject fidelity. Step 1000 + CAG: By step 1000, over-fitting leads to catastrophic attention collapse, focusing attention of all tokens mostly on the subject. CAG becomes ineffective as the model maps all latents to one of the input images.
  • Figure 3: Cross Attention Guided Image Synthesis: Across various subjects and prompts, our approach successfully preserves the layout of the reference underfit image as well as the identity of the input subject. Images generated by the Overfit (Edit) and Underfit (Guidance) models used, are shown for reference.
  • Figure 5: Comparison to non-fine-tuning methods TI (Textual Inversion), BLIP-D (BLIP-Diffusion), IP-A (IP-Adapter) and AnyDoor.
  • Figure 6: Image alignment (DINO) - text alignment (CLIP-T) space spanned by densely sampled operating points of DreamBooth (gray), Custom Diffusion (red) and our method (green) for two example subjects. Our method advances the pareto front and enables generation of images closer to top right corner [1,1] of the image-text alignment space, inaccessible to existing methods.
  • ...and 14 more figures