Insert In Style: A Zero-Shot Generative Framework for Harmonious Cross-Domain Object Composition
Raghu Vamsi Chittersu, Yuvraj Singh Rathore, Pranav Adlinge, Kunal Swami
TL;DR
Insert In Style tackles cross-domain object insertion by delivering a zero-shot generative framework that harmonizes a subject's identity with stylized backgrounds. It introduces a three-stage disentangled training protocol and a specialized masked-attention mechanism atop a diffusion-based latent backbone, enabling independent learning of identity ($Z_{ref}$) and style ($Z_{style}$) representations and their controlled composition using $Z_t$ and $Z_c$. A $100k$-sample dataset (with a rigorous two-stage human-calibrated filter) and a new public benchmark, Insert In Style Bench, underpin the approach, achieving state-of-the-art results on identity preservation, style coherence, and aesthetics, corroborated by a user study. The method remains competitive on in-domain photorealistic tasks, demonstrating broad generalization and practical potential for drag-and-drop cross-domain content insertion.
Abstract
Reference-based object composition methods fail when inserting real-world objects into stylized domains. This under-explored problem is currently split between practical "blenders" that lack generative fidelity and "generators" that require impractical, per-subject online finetuning. In this work, we introduce Insert In Style, the first zero-shot generative framework that is both practical and high-fidelity. Our core contribution is a unified framework with two key innovations: (i) a novel multi-stage training protocol that disentangles representations for identity, style, and composition, and (ii) a specialized masked-attention architecture that surgically enforces this disentanglement during generation. This approach prevents the concept interference common in general-purpose, unified-attention models. Our framework is trained on a new 100k sample dataset, curated from a novel data pipeline. This pipeline couples large-scale generation with a rigorous, two-stage filtering process to ensure both high-fidelity semantic identity and style coherence. Unlike prior work, our model is truly zero-shot and requires no text prompts. We also introduce a new public benchmark for stylized composition. We demonstrate state-of-the-art performance, significantly outperforming existing methods on both identity and style metrics, a result strongly corroborated by user studies.
