Table of Contents
Fetching ...

OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction

Leheng Li, Weichao Qiu, Xu Yan, Jing He, Kaiqiang Zhou, Yingjie Cai, Qing Lian, Bingbing Liu, Ying-Cong Chen

TL;DR

The core contribution lies in the proposed latent control signals, a high-dimensional spatial feature that provides a unified representation to integrate the spatial, textual, and image conditions seamlessly and enables fine-grained control with personalized identity.

Abstract

We present OmniBooth, an image generation framework that enables spatial control with instance-level multi-modal customization. For all instances, the multimodal instruction can be described through text prompts or image references. Given a set of user-defined masks and associated text or image guidance, our objective is to generate an image, where multiple objects are positioned at specified coordinates and their attributes are precisely aligned with the corresponding guidance. This approach significantly expands the scope of text-to-image generation, and elevates it to a more versatile and practical dimension in controllability. In this paper, our core contribution lies in the proposed latent control signals, a high-dimensional spatial feature that provides a unified representation to integrate the spatial, textual, and image conditions seamlessly. The text condition extends ControlNet to provide instance-level open-vocabulary generation. The image condition further enables fine-grained control with personalized identity. In practice, our method empowers users with more flexibility in controllable generation, as users can choose multi-modal conditions from text or images as needed. Furthermore, thorough experiments demonstrate our enhanced performance in image synthesis fidelity and alignment across different tasks and datasets. Project page: https://len-li.github.io/omnibooth-web/

OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction

TL;DR

The core contribution lies in the proposed latent control signals, a high-dimensional spatial feature that provides a unified representation to integrate the spatial, textual, and image conditions seamlessly and enables fine-grained control with personalized identity.

Abstract

We present OmniBooth, an image generation framework that enables spatial control with instance-level multi-modal customization. For all instances, the multimodal instruction can be described through text prompts or image references. Given a set of user-defined masks and associated text or image guidance, our objective is to generate an image, where multiple objects are positioned at specified coordinates and their attributes are precisely aligned with the corresponding guidance. This approach significantly expands the scope of text-to-image generation, and elevates it to a more versatile and practical dimension in controllability. In this paper, our core contribution lies in the proposed latent control signals, a high-dimensional spatial feature that provides a unified representation to integrate the spatial, textual, and image conditions seamlessly. The text condition extends ControlNet to provide instance-level open-vocabulary generation. The image condition further enables fine-grained control with personalized identity. In practice, our method empowers users with more flexibility in controllable generation, as users can choose multi-modal conditions from text or images as needed. Furthermore, thorough experiments demonstrate our enhanced performance in image synthesis fidelity and alignment across different tasks and datasets. Project page: https://len-li.github.io/omnibooth-web/
Paper Structure (31 sections, 6 equations, 12 figures, 5 tables)

This paper contains 31 sections, 6 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Overview of OmniBooth. We represent our conditions as a high-dimensional latent feature that seamlessly incorporates mask guidance and multi-modal instruction. We denote our conditions as latent control signal $\mathbf{lc}$. By painting the text embedding or warping the image embedding into $\mathbf{lc}$, we enable various modalities of control for image generation. In our framework, users can edit the input panoptic mask and instance instructions as needed to control the generated image.
  • Figure 2: Users are empowered to freely select either text or image as the condition. Spatial warping: To provide spatial-level identity features, we warp the 2D DINO spatial feature into our latent control signal. The mechanism is to use ROI align to map pixel-align latent into latent control signal. Then we randomly drop $10\%$ of the spatial embedding $\mathbf{s}_i$ and replace it with the DINO global embedding $\mathbf{g}_i$ to encode global identity.
  • Figure 3: Visualizations of text-instructed image generation. We compare our method with InstanceDiffusion wang2024instancediffusion. Our method exhibits a distinct advantage in handling dense and occluded scenarios, yielding images with pronounced depth relationships and hierarchical structures.
  • Figure 4: Image-instructed generation. Given a reference image and a target location described by instance mask, our method aims to generate instance with the same identity in the target location.
  • Figure 5: Zero-shot image-instructed generation. We condition image references from the DreamBooth dataset and utilize different global prompts and target masks to generate images. The input instances are masked out for conditioning.
  • ...and 7 more figures