Table of Contents
Fetching ...

Generate Anything Anywhere in Any Scene

Yuheng Li, Haotian Liu, Yangming Wen, Yong Jae Lee

TL;DR

This work addresses the problem of generating personalized objects with precise spatial control in text-to-image diffusion models. It proposes PACGen, which disentangles object identity from location/size via aggressive data augmentation during personalization and enables localization through GLIGEN-style adapters at inference, complemented by regionally-guided sampling to maintain fidelity. The approach demonstrates competitive or superior fidelity to existing personalized methods while offering explicit placement control, validated on multiple datasets and configurations. It also discusses practical considerations, including inference-time overhead and potential misuse, highlighting the need for responsible deployment in creative domains.

Abstract

Text-to-image diffusion models have attracted considerable interest due to their wide applicability across diverse fields. However, challenges persist in creating controllable models for personalized object generation. In this paper, we first identify the entanglement issues in existing personalized generative models, and then propose a straightforward and efficient data augmentation training strategy that guides the diffusion model to focus solely on object identity. By inserting the plug-and-play adapter layers from a pre-trained controllable diffusion model, our model obtains the ability to control the location and size of each generated personalized object. During inference, we propose a regionally-guided sampling technique to maintain the quality and fidelity of the generated images. Our method achieves comparable or superior fidelity for personalized objects, yielding a robust, versatile, and controllable text-to-image diffusion model that is capable of generating realistic and personalized images. Our approach demonstrates significant potential for various applications, such as those in art, entertainment, and advertising design.

Generate Anything Anywhere in Any Scene

TL;DR

This work addresses the problem of generating personalized objects with precise spatial control in text-to-image diffusion models. It proposes PACGen, which disentangles object identity from location/size via aggressive data augmentation during personalization and enables localization through GLIGEN-style adapters at inference, complemented by regionally-guided sampling to maintain fidelity. The approach demonstrates competitive or superior fidelity to existing personalized methods while offering explicit placement control, validated on multiple datasets and configurations. It also discusses practical considerations, including inference-time overhead and potential misuse, highlighting the need for responsible deployment in creative domains.

Abstract

Text-to-image diffusion models have attracted considerable interest due to their wide applicability across diverse fields. However, challenges persist in creating controllable models for personalized object generation. In this paper, we first identify the entanglement issues in existing personalized generative models, and then propose a straightforward and efficient data augmentation training strategy that guides the diffusion model to focus solely on object identity. By inserting the plug-and-play adapter layers from a pre-trained controllable diffusion model, our model obtains the ability to control the location and size of each generated personalized object. During inference, we propose a regionally-guided sampling technique to maintain the quality and fidelity of the generated images. Our method achieves comparable or superior fidelity for personalized objects, yielding a robust, versatile, and controllable text-to-image diffusion model that is capable of generating realistic and personalized images. Our approach demonstrates significant potential for various applications, such as those in art, entertainment, and advertising design.
Paper Structure (15 sections, 6 equations, 11 figures, 1 table, 1 algorithm)

This paper contains 15 sections, 6 equations, 11 figures, 1 table, 1 algorithm.

Figures (11)

  • Figure 1: Given just a handful of user images (left), our model, PACGen, can generate the personalized concept with both high fidelity and localization controllability in novel contexts (right).
  • Figure 2: A naive combination of DreamBooth and GLIGEN. (Left) The model generates accurate identities when the bounding box size and location roughly match those of the training distribution. (Right) However, it fails when the size and location fall outside the training distribution.
  • Figure 3: By incorporating a data augmentation technique that involves aggressive random resizing and repositioning of training images, PACGen effectively disentangles object identity and spatial information in personalized image generation.
  • Figure 4: DreamBooth incorrectly learns to entangle object identity with spatial information during training. It generates the correct identity only when the location matches the training distribution.
  • Figure 5: Data augmentation sometimes introduces collaging, multi-object, and dullness artifacts.
  • ...and 6 more figures