Table of Contents
Fetching ...

BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models

Senthil Purushwalkam, Akash Gokul, Shafiq Joty, Nikhil Naik

TL;DR

BootPIG addresses the challenge of personalized image generation without test-time finetuning by introducing a bootstrapped training framework for pretrained diffusion models. It uses a dual-UNet setup with Reference Self-Attention to inject reference appearances and a synthetic data pipeline to bootstrap learning, achieving zero-shot personalization in about an hour of training. On DreamBooth, BootPIG outperforms zero-shot baselines and rivals test-time finetuned methods, with user studies confirming higher fidelity to both subject appearance and prompts. The approach reduces computational burden while delivering strong subject fidelity, highlighting a practical path to flexible, controllable image generation without real data or additional encoders.

Abstract

Recent text-to-image generation models have demonstrated incredible success in generating images that faithfully follow input prompts. However, the requirement of using words to describe a desired concept provides limited control over the appearance of the generated concepts. In this work, we address this shortcoming by proposing an approach to enable personalization capabilities in existing text-to-image diffusion models. We propose a novel architecture (BootPIG) that allows a user to provide reference images of an object in order to guide the appearance of a concept in the generated images. The proposed BootPIG architecture makes minimal modifications to a pretrained text-to-image diffusion model and utilizes a separate UNet model to steer the generations toward the desired appearance. We introduce a training procedure that allows us to bootstrap personalization capabilities in the BootPIG architecture using data generated from pretrained text-to-image models, LLM chat agents, and image segmentation models. In contrast to existing methods that require several days of pretraining, the BootPIG architecture can be trained in approximately 1 hour. Experiments on the DreamBooth dataset demonstrate that BootPIG outperforms existing zero-shot methods while being comparable with test-time finetuning approaches. Through a user study, we validate the preference for BootPIG generations over existing methods both in maintaining fidelity to the reference object's appearance and aligning with textual prompts.

BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models

TL;DR

BootPIG addresses the challenge of personalized image generation without test-time finetuning by introducing a bootstrapped training framework for pretrained diffusion models. It uses a dual-UNet setup with Reference Self-Attention to inject reference appearances and a synthetic data pipeline to bootstrap learning, achieving zero-shot personalization in about an hour of training. On DreamBooth, BootPIG outperforms zero-shot baselines and rivals test-time finetuned methods, with user studies confirming higher fidelity to both subject appearance and prompts. The approach reduces computational burden while delivering strong subject fidelity, highlighting a practical path to flexible, controllable image generation without real data or additional encoders.

Abstract

Recent text-to-image generation models have demonstrated incredible success in generating images that faithfully follow input prompts. However, the requirement of using words to describe a desired concept provides limited control over the appearance of the generated concepts. In this work, we address this shortcoming by proposing an approach to enable personalization capabilities in existing text-to-image diffusion models. We propose a novel architecture (BootPIG) that allows a user to provide reference images of an object in order to guide the appearance of a concept in the generated images. The proposed BootPIG architecture makes minimal modifications to a pretrained text-to-image diffusion model and utilizes a separate UNet model to steer the generations toward the desired appearance. We introduce a training procedure that allows us to bootstrap personalization capabilities in the BootPIG architecture using data generated from pretrained text-to-image models, LLM chat agents, and image segmentation models. In contrast to existing methods that require several days of pretraining, the BootPIG architecture can be trained in approximately 1 hour. Experiments on the DreamBooth dataset demonstrate that BootPIG outperforms existing zero-shot methods while being comparable with test-time finetuning approaches. Through a user study, we validate the preference for BootPIG generations over existing methods both in maintaining fidelity to the reference object's appearance and aligning with textual prompts.
Paper Structure (28 sections, 6 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 28 sections, 6 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Existing text-to-image models demonstrate exceptional image synthesis capabilities. However, they fail to "personalize" generations according to a specific subject. BootPIG (Ours) enables zero-shot subject-driven generation through a bootstrapped training process that uses images synthesized by the text-to-image model (Bottom). BootPIG trained text-to-image models can synthesize novel scenes containing the input subject without test-time finetuning while maintaining high fidelity to the prompt and subject.
  • Figure 2: Model Architecture: We propose a novel architecture, that we refer to as BootPIG, for personalized image generation. The model comprises of two replicas of a latent diffusion model - Reference UNet and Base UNet. The Reference UNet processes reference images to collect the features before each Self-Attention (SA) layer. The Base UNet's SA layers are modified to Reference Self-Attention (RSA) layers that allow conditioning on extra features. Using the collected reference features as input, the Base UNet equipped with the RSA layers estimates the noise in the input to guide the image generation towards the reference objects.
  • Figure 3: Synthetic Training Data: We propose an automated data generation pipeline to generate (reference image, target image, target caption) triplets for training BootPIG. The pipeline uses ChatGPT to generate captions, Stable Diffusion to generate images and the Segment Anything Model to segment the foreground which serves as the reference image.
  • Figure 4: Qualitative Comparision: We provide visual comparisions of subject-driven generations from related methods such as BLIP-Diffusion, ELITE, and DreamBooth. BootPIG exhibits high subject and prompt fidelity, outperforming related methods while avoiding test-time finetuning.
  • Figure 5: User Study: We report the win rate (% of users who favored BootPIG generations) against existing methods. We perform two studies per head-to-head comparision, one evaluating prompt fidelity and the other evaluating subject fidelity.
  • ...and 7 more figures