Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization
Henglei Lv, Jiayu Xiao, Liang Li, Qingming Huang
TL;DR
Pick-and-Draw addresses overfitting in text-to-image personalization by introducing a training-free, two-component semantic guidance framework. Appearance picking guidance transfers appearance cues from a single reference image via a saliency-guided feature palette and the Unidirectional Relaxed Earth Mover Distance, while layout drawing guidance injects external priors from vanilla diffusion through cross-attention layout alignment. The method is shown to improve identity consistency and context diversity across baselines (e.g., Textual Inversion, DreamBooth, BLIP-Diffusion) and even yields favorable results when applied directly to vanilla Stable Diffusion. Quantitative gains on the DreamBench dataset (DINO, CLIP-I, CLIP-T) and ablation studies validate the effectiveness of the two losses and the activation-selection strategy. Overall, Pick-and-Draw pushes the trade-off frontier between subject fidelity and image-text alignment, offering a versatile, training-free path for robust one-shot personalization with practical impact for downstream tasks and tools.
Abstract
Diffusion-based text-to-image personalization have achieved great success in generating subjects specified by users among various contexts. Even though, existing finetuning-based methods still suffer from model overfitting, which greatly harms the generative diversity, especially when given subject images are few. To this end, we propose Pick-and-Draw, a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods. Our approach consists of two components: appearance picking guidance and layout drawing guidance. As for the former, we construct an appearance palette with visual features from the reference image, where we pick local patterns for generating the specified subject with consistent identity. As for layout drawing, we outline the subject's contour by referring to a generative template from the vanilla diffusion model, and inherit the strong image prior to synthesize diverse contexts according to different text conditions. The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image. Qualitative and quantitative experiments show that Pick-and-Draw consistently improves identity consistency and generative diversity, pushing the trade-off between subject fidelity and image-text fidelity to a new Pareto frontier.
