Table of Contents
Fetching ...

Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization

Henglei Lv, Jiayu Xiao, Liang Li, Qingming Huang

TL;DR

Pick-and-Draw addresses overfitting in text-to-image personalization by introducing a training-free, two-component semantic guidance framework. Appearance picking guidance transfers appearance cues from a single reference image via a saliency-guided feature palette and the Unidirectional Relaxed Earth Mover Distance, while layout drawing guidance injects external priors from vanilla diffusion through cross-attention layout alignment. The method is shown to improve identity consistency and context diversity across baselines (e.g., Textual Inversion, DreamBooth, BLIP-Diffusion) and even yields favorable results when applied directly to vanilla Stable Diffusion. Quantitative gains on the DreamBench dataset (DINO, CLIP-I, CLIP-T) and ablation studies validate the effectiveness of the two losses and the activation-selection strategy. Overall, Pick-and-Draw pushes the trade-off frontier between subject fidelity and image-text alignment, offering a versatile, training-free path for robust one-shot personalization with practical impact for downstream tasks and tools.

Abstract

Diffusion-based text-to-image personalization have achieved great success in generating subjects specified by users among various contexts. Even though, existing finetuning-based methods still suffer from model overfitting, which greatly harms the generative diversity, especially when given subject images are few. To this end, we propose Pick-and-Draw, a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods. Our approach consists of two components: appearance picking guidance and layout drawing guidance. As for the former, we construct an appearance palette with visual features from the reference image, where we pick local patterns for generating the specified subject with consistent identity. As for layout drawing, we outline the subject's contour by referring to a generative template from the vanilla diffusion model, and inherit the strong image prior to synthesize diverse contexts according to different text conditions. The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image. Qualitative and quantitative experiments show that Pick-and-Draw consistently improves identity consistency and generative diversity, pushing the trade-off between subject fidelity and image-text fidelity to a new Pareto frontier.

Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization

TL;DR

Pick-and-Draw addresses overfitting in text-to-image personalization by introducing a training-free, two-component semantic guidance framework. Appearance picking guidance transfers appearance cues from a single reference image via a saliency-guided feature palette and the Unidirectional Relaxed Earth Mover Distance, while layout drawing guidance injects external priors from vanilla diffusion through cross-attention layout alignment. The method is shown to improve identity consistency and context diversity across baselines (e.g., Textual Inversion, DreamBooth, BLIP-Diffusion) and even yields favorable results when applied directly to vanilla Stable Diffusion. Quantitative gains on the DreamBench dataset (DINO, CLIP-I, CLIP-T) and ablation studies validate the effectiveness of the two losses and the activation-selection strategy. Overall, Pick-and-Draw pushes the trade-off frontier between subject fidelity and image-text alignment, offering a versatile, training-free path for robust one-shot personalization with practical impact for downstream tasks and tools.

Abstract

Diffusion-based text-to-image personalization have achieved great success in generating subjects specified by users among various contexts. Even though, existing finetuning-based methods still suffer from model overfitting, which greatly harms the generative diversity, especially when given subject images are few. To this end, we propose Pick-and-Draw, a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods. Our approach consists of two components: appearance picking guidance and layout drawing guidance. As for the former, we construct an appearance palette with visual features from the reference image, where we pick local patterns for generating the specified subject with consistent identity. As for layout drawing, we outline the subject's contour by referring to a generative template from the vanilla diffusion model, and inherit the strong image prior to synthesize diverse contexts according to different text conditions. The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image. Qualitative and quantitative experiments show that Pick-and-Draw consistently improves identity consistency and generative diversity, pushing the trade-off between subject fidelity and image-text fidelity to a new Pareto frontier.
Paper Structure (21 sections, 10 equations, 12 figures, 2 tables)

This paper contains 21 sections, 10 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Given a single reference image, Pick-and-Draw consistently improves identity consistency and image-text alignment over various personalization methods, including Textual Inversion, DreamBooth, and BLIP-Diffusion. The text prompt is "A photo of a dog in water". Additionally, Directly applying Pick-and-Draw on vanilla Stable Diffusion also produces acceptable outcomes.
  • Figure 2: Overall pipeline of our proposed Pick-and-Draw. We iteratively refine the generative outcomes via appearance picking and layout drawing, which is achieved by optimizing a designed score function. In appearance picking, we pick saliency-aware features from certain cross attention decoder layers, and transfer the appearance cues by minimizing the Unidirectional Relaxed Earth Movers Distance (UREMD), aiming to boost identity consistency. For layout drawing, we extract cross attention maps in every cross attention layer, smooth them with a Gaussian kernel, then minimize the Frobenius norm to draw the subject outline. This localizes the appearance transfer within the subject-relative regions and introduces novel layout from the vanilla Stable Diffusion, so as to improve generative diversity.
  • Figure 3: Illustration of cross attention maps extracted from different layers in the encoder and decoder of the UNet, numbered by inference order. Resolution is marked on the left.
  • Figure 4: Qualitative results on different baselines with and without Pick-and-Draw. The format of text prompt slightly differs across the three baselines and we choose the DreamBooth format for presentation.
  • Figure 5: Alignment metrics of BLIP-Diffusion before (red) and after (blue) applying Pick-and-Draw for sample subjects.
  • ...and 7 more figures