Object-level Visual Prompts for Compositional Image Generation
Gaurav Parmar, Or Patashnik, Kuan-Chieh Wang, Daniil Ostashev, Srinivasa Narasimhan, Jun-Yan Zhu, Daniel Cohen-Or, Kfir Aberman
TL;DR
VisualComposer addresses the challenge of composing multi-object scenes conditioned on object-level visual prompts in diffusion models. It introduces KV-Mixed Cross-Attention, which uses a coarse encoder for keys to guide layout and a fine-grained encoder for values to preserve appearance, enabling strong identity retention without sacrificing layout diversity. The method adds Compositional Guidance at inference, aligning prompts to detected segments via segmentation, DINOv2 similarity, and Hungarian matching, while refining appearance tokens with an identity-focused loss. Empirical results show superior adherence to input prompts and greater layout diversity compared with prior image-prompt and multimodal approaches, on real and synthetic multi-object datasets. Overall, VisualComposer offers a practical, controllable framework for object-level visual prompt composition in text-to-image diffusion models, with implications for fine-grained scene synthesis and potential downstream i) detection and attribution challenges.
Abstract
We introduce a method for composing object-level visual prompts within a text-to-image diffusion model. Our approach addresses the task of generating semantically coherent compositions across diverse scenes and styles, similar to the versatility and expressiveness offered by text prompts. A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts, while also generating diverse compositions across different images. To address this challenge, we introduce a new KV-mixed cross-attention mechanism, in which keys and values are learned from distinct visual representations. The keys are derived from an encoder with a small bottleneck for layout control, whereas the values come from a larger bottleneck encoder that captures fine-grained appearance details. By mixing keys and values from these complementary sources, our model preserves the identity of the visual prompts while supporting flexible variations in object arrangement, pose, and composition. During inference, we further propose object-level compositional guidance to improve the method's identity preservation and layout correctness. Results show that our technique produces diverse scene compositions that preserve the unique characteristics of each visual prompt, expanding the creative potential of text-to-image generation.
