Table of Contents
Fetching ...

Multitwine: Multi-Object Compositing with Text and Layout Control

Gemma Canet Tarrés, Zhe Lin, Zhifei Zhang, He Zhang, Andrew Gilbert, John Collomosse, Soo Ye Kim

TL;DR

Multitwine introduces a diffusion-based framework capable of simultaneous multi-object compositing guided by both textual prompts and explicit layout. It fuses object images, a background, and a layout mask into a multimodal embedding processed by a Stable Diffusion backbone, with cross-attention mechanisms that preserve object identity while enforcing scene-level coherence. A joint training regime for compositing and subject-driven customization uses three losses—$L_d$ (denoising), $L_c$ (identity disentanglement), and $L_s$ (inter-object leakage suppression)—combined as $L = L_d + \alpha L_c + \beta L_s$, along with data-generation pipelines leveraging LLMs and VLMs to create richly aligned multimodal training data. The approach achieves state-of-the-art performance in both simultaneous multi-object compositing and subject-driven generation, enabling applications such as subject-driven inpainting and complex interactive scenes while highlighting limitations in scalability to very large object counts and pointing toward future improvements with stronger diffusion backbones.

Abstract

We introduce the first generative model capable of simultaneous multi-object compositing, guided by both text and layout. Our model allows for the addition of multiple objects within a scene, capturing a range of interactions from simple positional relations (e.g., next to, in front of) to complex actions requiring reposing (e.g., hugging, playing guitar). When an interaction implies additional props, like `taking a selfie', our model autonomously generates these supporting objects. By jointly training for compositing and subject-driven generation, also known as customization, we achieve a more balanced integration of textual and visual inputs for text-driven object compositing. As a result, we obtain a versatile model with state-of-the-art performance in both tasks. We further present a data generation pipeline leveraging visual and language models to effortlessly synthesize multimodal, aligned training data.

Multitwine: Multi-Object Compositing with Text and Layout Control

TL;DR

Multitwine introduces a diffusion-based framework capable of simultaneous multi-object compositing guided by both textual prompts and explicit layout. It fuses object images, a background, and a layout mask into a multimodal embedding processed by a Stable Diffusion backbone, with cross-attention mechanisms that preserve object identity while enforcing scene-level coherence. A joint training regime for compositing and subject-driven customization uses three losses— (denoising), (identity disentanglement), and (inter-object leakage suppression)—combined as , along with data-generation pipelines leveraging LLMs and VLMs to create richly aligned multimodal training data. The approach achieves state-of-the-art performance in both simultaneous multi-object compositing and subject-driven generation, enabling applications such as subject-driven inpainting and complex interactive scenes while highlighting limitations in scalability to very large object counts and pointing toward future improvements with stronger diffusion backbones.

Abstract

We introduce the first generative model capable of simultaneous multi-object compositing, guided by both text and layout. Our model allows for the addition of multiple objects within a scene, capturing a range of interactions from simple positional relations (e.g., next to, in front of) to complex actions requiring reposing (e.g., hugging, playing guitar). When an interaction implies additional props, like `taking a selfie', our model autonomously generates these supporting objects. By jointly training for compositing and subject-driven generation, also known as customization, we achieve a more balanced integration of textual and visual inputs for text-driven object compositing. As a result, we obtain a versatile model with state-of-the-art performance in both tasks. We further present a data generation pipeline leveraging visual and language models to effortlessly synthesize multimodal, aligned training data.

Paper Structure

This paper contains 24 sections, 3 equations, 23 figures, 3 tables.

Figures (23)

  • Figure 1: Training Data Generation from Video Data. Paired training data is obtained from video object relation datasets shang2017videoshang2019annotating by extracting three frames with corresponding annotations and leveraging Vision-Language Models liu2023improvedllava.
  • Figure 2: Comparison of simultaneous vs. sequential object compositing. Sequential addition prevents reposing of previously composited objects, resulting in limited, less cohesive compositions.
  • Figure 2: Training Data Generation from Image Data via Top-Down Approach. Paired training data is derived from in-the-wild images by leveraging a Vision-Language Model cai2024vip and a Semantic Segmentator qi2022entityseg.
  • Figure 3: Model Architecture. Our model consists of: (i) A Stable Diffusion backbone including a U-Net and an autoencoder ($\mathcal{G}$, $\mathcal{D}$); (ii) a text encoder $\mathcal{E}_{T}$; (iii) an image encoder $\mathcal{E}_{I}$; and (iv) an adaptor $\mathcal{A}$. Given a text prompt $\mathcal{C}$ and images of N objects $\mathcal{O}_{0\dots, N-1}$, the text embedding from (iii) is augmented by concatenating each image embedding after their corresponding text tokens. The resulting multimodal embedding $\mathcal{H}$ is fed to the U-Net via cross-attention. Masked background image $(1-\mathcal{M}_{G})*\mathcal{I}_{BG}$ and layout $\mathcal{I}_{L}$ with object-specific bboxes are concatenated to input $\mathcal{I}$.
  • Figure 3: Training Data Generation from Image Data via Bottom-Up Approach. Paired training data is extracted from in-the-wild images with a paired caption by leveraging a Grounding Model liu2023groundingdino and a Semantic Segmentator qi2022entityseg.
  • ...and 18 more figures