Table of Contents
Fetching ...

Magic Insert: Style-Aware Drag-and-Drop

Nataniel Ruiz, Yuanzhen Li, Neal Wadhwa, Yael Pritch, Michael Rubinstein, David E. Jacobs, Shlomi Fruchter

TL;DR

Magic Insert formalizes style-aware drag-and-drop and presents a diffusion-based pipeline that preserves subject identity while adopting the target image's style. It combines style-aware personalization via LoRA and dual embeddings with style-injection through IP-Adapter, and introduces Bootstrapped Domain Adaptation to adapt a real-image insertion model to stylized domains. The SubjectPlop dataset provides a standardized benchmark spanning diverse styles for evaluation. Empirical results show improved style adherence and insertion realism over inpainting baselines, with flexible subject edits and scene interactions demonstrated via LLM-guided affordances. The work advances practical capabilities for creative image composition in stylized domains and offers a public dataset and evaluation suite to spur further research.

Abstract

We present Magic Insert, a method for dragging-and-dropping subjects from a user-provided image into a target image of a different style in a physically plausible manner while matching the style of the target image. This work formalizes the problem of style-aware drag-and-drop and presents a method for tackling it by addressing two sub-problems: style-aware personalization and realistic object insertion in stylized images. For style-aware personalization, our method first fine-tunes a pretrained text-to-image diffusion model using LoRA and learned text tokens on the subject image, and then infuses it with a CLIP representation of the target style. For object insertion, we use Bootstrapped Domain Adaption to adapt a domain-specific photorealistic object insertion model to the domain of diverse artistic styles. Overall, the method significantly outperforms traditional approaches such as inpainting. Finally, we present a dataset, SubjectPlop, to facilitate evaluation and future progress in this area. Project page: https://magicinsert.github.io/

Magic Insert: Style-Aware Drag-and-Drop

TL;DR

Magic Insert formalizes style-aware drag-and-drop and presents a diffusion-based pipeline that preserves subject identity while adopting the target image's style. It combines style-aware personalization via LoRA and dual embeddings with style-injection through IP-Adapter, and introduces Bootstrapped Domain Adaptation to adapt a real-image insertion model to stylized domains. The SubjectPlop dataset provides a standardized benchmark spanning diverse styles for evaluation. Empirical results show improved style adherence and insertion realism over inpainting baselines, with flexible subject edits and scene interactions demonstrated via LLM-guided affordances. The work advances practical capabilities for creative image composition in stylized domains and offers a public dataset and evaluation suite to spur further research.

Abstract

We present Magic Insert, a method for dragging-and-dropping subjects from a user-provided image into a target image of a different style in a physically plausible manner while matching the style of the target image. This work formalizes the problem of style-aware drag-and-drop and presents a method for tackling it by addressing two sub-problems: style-aware personalization and realistic object insertion in stylized images. For style-aware personalization, our method first fine-tunes a pretrained text-to-image diffusion model using LoRA and learned text tokens on the subject image, and then infuses it with a CLIP representation of the target style. For object insertion, we use Bootstrapped Domain Adaption to adapt a domain-specific photorealistic object insertion model to the domain of diverse artistic styles. Overall, the method significantly outperforms traditional approaches such as inpainting. Finally, we present a dataset, SubjectPlop, to facilitate evaluation and future progress in this area. Project page: https://magicinsert.github.io/
Paper Structure (23 sections, 4 equations, 10 figures, 4 tables)

This paper contains 23 sections, 4 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Using Magic Insert we are able to, for the first time, drag-and-drop a subject from an image with an arbitrary style onto another target image with a vastly different style and achieve a style-aware and realistic insertion of the subject into the target image.
  • Figure 2: Style-Aware Personalization: To generate a subject that fully respects the style of the target image while also conserving the subject's essence and identity, we (1) personalize a diffusion model in both weight and embedding space, by training LoRA deltas on top of the pre-trained diffusion model and simultaneously training the embedding of two text tokens using the diffusion denoising loss (2) use this personalized diffusion model to generate the style-aware subject by embedding the style of the target image and conducting adapter style-injection into select upsampling layers of the model during denoising.
  • Figure 3: Subject Insertion: In order to insert the style-aware personalized generation, we (1) copy-paste a segmented version of the subject onto the target image (2) run our subject insertion model on the deshadowed image - this creates context cues and realistically embeds the subject into the image including shadows and reflections.
  • Figure 4: Bootstrapped Domain Adaptation: Surprisingly, a diffusion model trained for subject insertion/removal on data captured in the real world can generalize to images in the wider stylistic domain in a limited fashion. We introduce bootstrapped domain adaptation, where a model's effective domain can be adapted by using a subset of its own outputs. (left) Specifically, we use a subject removal/insertion model to first remove subjects and shadows from a dataset from our target domain. Then, we filter flawed outputs, and use the filtered set of images to retrain the subject removal/insertion model. (right) We observe that, the initial distribution (blue) changes after training (purple) and initially incorrectly treated images (red samples) are subsequently correctly treated (green). When doing bootstrapped domain adaptation, we train on only the initially correct samples (green).
  • Figure 5: Results Gallery: Examples of our Magic Insert method for different subjects and backgrounds with vastly different styles.
  • ...and 5 more figures