Table of Contents
Fetching ...

Object-level Visual Prompts for Compositional Image Generation

Gaurav Parmar, Or Patashnik, Kuan-Chieh Wang, Daniil Ostashev, Srinivasa Narasimhan, Jun-Yan Zhu, Daniel Cohen-Or, Kfir Aberman

TL;DR

VisualComposer addresses the challenge of composing multi-object scenes conditioned on object-level visual prompts in diffusion models. It introduces KV-Mixed Cross-Attention, which uses a coarse encoder for keys to guide layout and a fine-grained encoder for values to preserve appearance, enabling strong identity retention without sacrificing layout diversity. The method adds Compositional Guidance at inference, aligning prompts to detected segments via segmentation, DINOv2 similarity, and Hungarian matching, while refining appearance tokens with an identity-focused loss. Empirical results show superior adherence to input prompts and greater layout diversity compared with prior image-prompt and multimodal approaches, on real and synthetic multi-object datasets. Overall, VisualComposer offers a practical, controllable framework for object-level visual prompt composition in text-to-image diffusion models, with implications for fine-grained scene synthesis and potential downstream i) detection and attribution challenges.

Abstract

We introduce a method for composing object-level visual prompts within a text-to-image diffusion model. Our approach addresses the task of generating semantically coherent compositions across diverse scenes and styles, similar to the versatility and expressiveness offered by text prompts. A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts, while also generating diverse compositions across different images. To address this challenge, we introduce a new KV-mixed cross-attention mechanism, in which keys and values are learned from distinct visual representations. The keys are derived from an encoder with a small bottleneck for layout control, whereas the values come from a larger bottleneck encoder that captures fine-grained appearance details. By mixing keys and values from these complementary sources, our model preserves the identity of the visual prompts while supporting flexible variations in object arrangement, pose, and composition. During inference, we further propose object-level compositional guidance to improve the method's identity preservation and layout correctness. Results show that our technique produces diverse scene compositions that preserve the unique characteristics of each visual prompt, expanding the creative potential of text-to-image generation.

Object-level Visual Prompts for Compositional Image Generation

TL;DR

VisualComposer addresses the challenge of composing multi-object scenes conditioned on object-level visual prompts in diffusion models. It introduces KV-Mixed Cross-Attention, which uses a coarse encoder for keys to guide layout and a fine-grained encoder for values to preserve appearance, enabling strong identity retention without sacrificing layout diversity. The method adds Compositional Guidance at inference, aligning prompts to detected segments via segmentation, DINOv2 similarity, and Hungarian matching, while refining appearance tokens with an identity-focused loss. Empirical results show superior adherence to input prompts and greater layout diversity compared with prior image-prompt and multimodal approaches, on real and synthetic multi-object datasets. Overall, VisualComposer offers a practical, controllable framework for object-level visual prompt composition in text-to-image diffusion models, with implications for fine-grained scene synthesis and potential downstream i) detection and attribution challenges.

Abstract

We introduce a method for composing object-level visual prompts within a text-to-image diffusion model. Our approach addresses the task of generating semantically coherent compositions across diverse scenes and styles, similar to the versatility and expressiveness offered by text prompts. A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts, while also generating diverse compositions across different images. To address this challenge, we introduce a new KV-mixed cross-attention mechanism, in which keys and values are learned from distinct visual representations. The keys are derived from an encoder with a small bottleneck for layout control, whereas the values come from a larger bottleneck encoder that captures fine-grained appearance details. By mixing keys and values from these complementary sources, our model preserves the identity of the visual prompts while supporting flexible variations in object arrangement, pose, and composition. During inference, we further propose object-level compositional guidance to improve the method's identity preservation and layout correctness. Results show that our technique produces diverse scene compositions that preserve the unique characteristics of each visual prompt, expanding the creative potential of text-to-image generation.
Paper Structure (43 sections, 3 equations, 14 figures, 2 tables)

This paper contains 43 sections, 3 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: We introduce a method for composing object-level visual prompts (shown above each column), where prompts consist of both foreground and background elements that jointly guide the generation in text-to-image models. Similar to text prompts, these visual prompts enable creating semantically coherent compositions across a variety of styles and scenes without the need for a predefined layout.
  • Figure 2: KV-Mixing. Image Prompt Adapters capture visual information from images to guide the generation process. The feature extractor's bottleneck size (top row) determines the level of detail in the extracted Key-Value (KV) features. Using only coarse KVs (left) sacrifices identity preservation, while using only fine-grained KVs (middle) limits scene variation. In contrast, combining mixed-granularity KVs (right) achieves diverse scene representation without compromising identity preservation.
  • Figure 3: VisualComposer architecture. Our method begins by encoding all input visual prompts through two separate branches: an appearance branch (top row, shown in orange) that uses a Fine-Grained encoder followed by an Appearance adapter to encode per-prompt appearance tokens, and a layout branch (bottom row, shown in blue) that uses a Coarse encoder followed by a Layout adapter to encode per-prompt layout tokens. Once the appearance and layout tokens are extracted from the input visual prompts, they are injected into the U-Net through Object-Centric KV-Mixed Cross Attention layers. The layout tokens are input as keys and determine the spatial influence of each individual visual prompt in the final image, as visualized by the per-object attention masks. The appearance tokens are input as values after attention mask is computed and hence only influence the appearance and the identity.
  • Figure 4: Gallery. Compositional images generated by VisualComposer. Four outputs (right) for each set of input visual prompts (left).
  • Figure 5: Comparisons to prior methods. We show a set of input visual prompts on the left. For each set, we show results generated by different methods. Our method achieves the best balance between identity preservation of the input prompts and image diversity. Our method is the only one that successfully generates the two objects in realistic layouts without fusing them or outputting duplicates.
  • ...and 9 more figures