Table of Contents
Fetching ...

Insert In Style: A Zero-Shot Generative Framework for Harmonious Cross-Domain Object Composition

Raghu Vamsi Chittersu, Yuvraj Singh Rathore, Pranav Adlinge, Kunal Swami

TL;DR

Insert In Style tackles cross-domain object insertion by delivering a zero-shot generative framework that harmonizes a subject's identity with stylized backgrounds. It introduces a three-stage disentangled training protocol and a specialized masked-attention mechanism atop a diffusion-based latent backbone, enabling independent learning of identity ($Z_{ref}$) and style ($Z_{style}$) representations and their controlled composition using $Z_t$ and $Z_c$. A $100k$-sample dataset (with a rigorous two-stage human-calibrated filter) and a new public benchmark, Insert In Style Bench, underpin the approach, achieving state-of-the-art results on identity preservation, style coherence, and aesthetics, corroborated by a user study. The method remains competitive on in-domain photorealistic tasks, demonstrating broad generalization and practical potential for drag-and-drop cross-domain content insertion.

Abstract

Reference-based object composition methods fail when inserting real-world objects into stylized domains. This under-explored problem is currently split between practical "blenders" that lack generative fidelity and "generators" that require impractical, per-subject online finetuning. In this work, we introduce Insert In Style, the first zero-shot generative framework that is both practical and high-fidelity. Our core contribution is a unified framework with two key innovations: (i) a novel multi-stage training protocol that disentangles representations for identity, style, and composition, and (ii) a specialized masked-attention architecture that surgically enforces this disentanglement during generation. This approach prevents the concept interference common in general-purpose, unified-attention models. Our framework is trained on a new 100k sample dataset, curated from a novel data pipeline. This pipeline couples large-scale generation with a rigorous, two-stage filtering process to ensure both high-fidelity semantic identity and style coherence. Unlike prior work, our model is truly zero-shot and requires no text prompts. We also introduce a new public benchmark for stylized composition. We demonstrate state-of-the-art performance, significantly outperforming existing methods on both identity and style metrics, a result strongly corroborated by user studies.

Insert In Style: A Zero-Shot Generative Framework for Harmonious Cross-Domain Object Composition

TL;DR

Insert In Style tackles cross-domain object insertion by delivering a zero-shot generative framework that harmonizes a subject's identity with stylized backgrounds. It introduces a three-stage disentangled training protocol and a specialized masked-attention mechanism atop a diffusion-based latent backbone, enabling independent learning of identity () and style () representations and their controlled composition using and . A -sample dataset (with a rigorous two-stage human-calibrated filter) and a new public benchmark, Insert In Style Bench, underpin the approach, achieving state-of-the-art results on identity preservation, style coherence, and aesthetics, corroborated by a user study. The method remains competitive on in-domain photorealistic tasks, demonstrating broad generalization and practical potential for drag-and-drop cross-domain content insertion.

Abstract

Reference-based object composition methods fail when inserting real-world objects into stylized domains. This under-explored problem is currently split between practical "blenders" that lack generative fidelity and "generators" that require impractical, per-subject online finetuning. In this work, we introduce Insert In Style, the first zero-shot generative framework that is both practical and high-fidelity. Our core contribution is a unified framework with two key innovations: (i) a novel multi-stage training protocol that disentangles representations for identity, style, and composition, and (ii) a specialized masked-attention architecture that surgically enforces this disentanglement during generation. This approach prevents the concept interference common in general-purpose, unified-attention models. Our framework is trained on a new 100k sample dataset, curated from a novel data pipeline. This pipeline couples large-scale generation with a rigorous, two-stage filtering process to ensure both high-fidelity semantic identity and style coherence. Unlike prior work, our model is truly zero-shot and requires no text prompts. We also introduce a new public benchmark for stylized composition. We demonstrate state-of-the-art performance, significantly outperforming existing methods on both identity and style metrics, a result strongly corroborated by user studies.

Paper Structure

This paper contains 31 sections, 6 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Insert In Style: Zero-Shot Cross-Domain Composition.(Rows 1-2) Comparison with the state-of-the-art cross-domain method AIComposer aicomposer_iccv2025. AIComposer's "blend-then-refine" approach corrupts object identity by misapplying background features. Insert In Style consistently generates a high-fidelity subject that is perfectly harmonized with the scene style. (Row 3) We demonstrate Insert In Style's versatile generalization: our single, zero-shot model seamlessly inserts one subject into four distinct stylized backgrounds.
  • Figure 2: Insert In Style generalizes across in-domain and cross-domain tasks. Top (In-domain): The cross-domain specialist method AIComposer aicomposer_iccv2025 incorrectly harmonizes the object. Our method maintains high fidelity, competitive with the in-domain specialist method DreamFuse dreamfuse_iccv2025. Bottom (Cross-domain): DreamFuse dreamfuse_iccv2025 fails with a style mismatch, while AIComposer's aicomposer_iccv2025 harmonization corrupts object fidelity by incorrectly applying background style attributes. Insert In Style uniquely generates a high-fidelity, style-coherent result.
  • Figure 3: Dataset Pipeline.(a) Generation: We create a large-scale, diverse raw corpus by applying a mix of state-of-the-art stylization methods (FLUX.1-Kontext fluxkontextdev, CSGO csgo_neurips2025, and CAST cast_siggraph2022). (b) Filtering: Our raw dataset is then refined by our rigorous two-stage filtering process. The Identity Consistency filter prunes samples with semantic drift in the subject region, while the Style Coherence filter removes aesthetic mismatches between the subject region and its surrounding background, together ensuring a high-fidelity dataset.
  • Figure 4: Qualitative samples from our Insert In Style Dataset. Spanning $100$k samples and $1,140$ unique styles, it is the largest-scale corpus for this task. Each $<$Subject, Composite, Stylized Composite$>$ triplet provides the strong, aligned supervision required to train robust, cross-domain insertion models.
  • Figure 5: Our multi-stage training protocol on a DiT backbone (a). Stages $1$ (b) and $2$ (c) are trained in parallel to independently learn object and style encoding. Stage-$3$ (d) learns composition by assembling these frozen branches, guided by our Structural Mask Attention (e).
  • ...and 3 more figures