Table of Contents
Fetching ...

Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model

Xin Zhang, Jiaxian Guo, Paul Yoo, Yutaka Matsuo, Yusuke Iwasawa

TL;DR

PhD addresses the challenge of subject-specific editing by combining a segmentation-driven Paste step with a dedicated Inpaint and Harmonize module, all while keeping the pre-trained diffusion backbone frozen to retain strong text-driven generation. The approach avoids fine-tuning the diffusion model and instead guides it through a learnable IPM that integrates pasted subjects into contextually coherent scenes. Across subject-driven editing and scene generation tasks, PhD achieves state-of-the-art results on quantitative metrics and delivers high-quality, semantically consistent composites, validated by user studies. The work offers a practical, flexible framework for exemplar-based editing with textual control, suitable for diverse subjects and scenes.

Abstract

Text-to-image generative models have attracted rising attention for flexible image editing via user-specified descriptions. However, text descriptions alone are not enough to elaborate the details of subjects, often compromising the subjects' identity or requiring additional per-subject fine-tuning. We introduce a new framework called \textit{Paste, Inpaint and Harmonize via Denoising} (PhD), which leverages an exemplar image in addition to text descriptions to specify user intentions. In the pasting step, an off-the-shelf segmentation model is employed to identify a user-specified subject within an exemplar image which is subsequently inserted into a background image to serve as an initialization capturing both scene context and subject identity in one. To guarantee the visual coherence of the generated or edited image, we introduce an inpainting and harmonizing module to guide the pre-trained diffusion model to seamlessly blend the inserted subject into the scene naturally. As we keep the pre-trained diffusion model frozen, we preserve its strong image synthesis ability and text-driven ability, thus achieving high-quality results and flexible editing with diverse texts. In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject. Both quantitative and qualitative comparisons with baseline methods demonstrate that our approach achieves state-of-the-art performance in both tasks. More qualitative results can be found at \url{https://sites.google.com/view/phd-demo-page}.

Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model

TL;DR

PhD addresses the challenge of subject-specific editing by combining a segmentation-driven Paste step with a dedicated Inpaint and Harmonize module, all while keeping the pre-trained diffusion backbone frozen to retain strong text-driven generation. The approach avoids fine-tuning the diffusion model and instead guides it through a learnable IPM that integrates pasted subjects into contextually coherent scenes. Across subject-driven editing and scene generation tasks, PhD achieves state-of-the-art results on quantitative metrics and delivers high-quality, semantically consistent composites, validated by user studies. The work offers a practical, flexible framework for exemplar-based editing with textual control, suitable for diverse subjects and scenes.

Abstract

Text-to-image generative models have attracted rising attention for flexible image editing via user-specified descriptions. However, text descriptions alone are not enough to elaborate the details of subjects, often compromising the subjects' identity or requiring additional per-subject fine-tuning. We introduce a new framework called \textit{Paste, Inpaint and Harmonize via Denoising} (PhD), which leverages an exemplar image in addition to text descriptions to specify user intentions. In the pasting step, an off-the-shelf segmentation model is employed to identify a user-specified subject within an exemplar image which is subsequently inserted into a background image to serve as an initialization capturing both scene context and subject identity in one. To guarantee the visual coherence of the generated or edited image, we introduce an inpainting and harmonizing module to guide the pre-trained diffusion model to seamlessly blend the inserted subject into the scene naturally. As we keep the pre-trained diffusion model frozen, we preserve its strong image synthesis ability and text-driven ability, thus achieving high-quality results and flexible editing with diverse texts. In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject. Both quantitative and qualitative comparisons with baseline methods demonstrate that our approach achieves state-of-the-art performance in both tasks. More qualitative results can be found at \url{https://sites.google.com/view/phd-demo-page}.
Paper Structure (31 sections, 4 equations, 12 figures, 4 tables)

This paper contains 31 sections, 4 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Qualitative comparisons with previous subject-driven image editing methods, where PbE denotes Paint-by-Example yang2022paint. The area outlined by red line denotes the editing area.
  • Figure 2: The illustration of our proposed Paste, Inpaint and Harmonize via Denoising (PhD) framework. In the Paste step, we extract the subject from exemplar image $I_q$ using segmentation model and removed the background of the objects within the mask $I_m$ to obtain $I_{e}$. Then we paste $I_{e}$ onto the masked scene image to obtain the pasted image $\hat{I_p}$. In the Inpaint and Harmonize via Denoising step, the Inpainting and Harmonizing module $F_c$ takes $\hat{I_p}$ as the input, and output editing information $c$ to guide the frozen pre-trained diffusion models. The text encoder $F_t$ takes textual prompts as input so that it is able to adjust the style or scene in the edited image.
  • Figure 3: Qualitative results of subject-driven image editing methods, where B-D denotes blended-diffusion model avrahami2022blended and PbE denotes Paint-by-Example yang2022paint. The results are generated by our method without any further optimization. More quantitative results can be found in Appendix.
  • Figure 4: Qualitative results of subject-driven scene generation and style transfer with texts, where the title denotes the name of scene and style.
  • Figure 5: Ablation Study. Compare with the naive Stable Diffusion approach. I2I denotes the image-to-image pipeline with the edited image $\hat{I}_p$ as input. Inpaint denotes inpainting $\hat{I}_p$, Inpaint* is Inpaint with a null prompt, and ldm refers to directly fine-tuning the latent diffusion model.
  • ...and 7 more figures