Table of Contents
Fetching ...

DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models

Shaoan Xie, Yang Zhao, Zhisheng Xiao, Kelvin C. K. Chan, Yandong Li, Yanwu Xu, Kun Zhang, Tingbo Hou

TL;DR

This work introduces DreamInpainter, a diffusion-model-based framework for Text-Guided Subject-Driven Image Inpainting that uses both text prompts and a reference exemplar to guide inpainting. It tackles the copy-paste risk and limited text control by (1) extracting discriminative dense subject features from the UNet downstack, (2) selecting the top-K tokens to preserve identity while enabling edits, and (3) applying a decoupling regularization that imposes text-driven restoration over the entire image. Empirical results on COCOEE and DreamBoothEE show improved realism and stronger alignment with text prompts, along with ablations validating the effectiveness of token selection and decoupling regularization. The approach enables a range of applications from faithful subject insertion to stylized and attribute-edited inpainted content, contributing a practical solution for balanced, controllable inpainting with dual guidance signals.

Abstract

This study introduces Text-Guided Subject-Driven Image Inpainting, a novel task that combines text and exemplar images for image inpainting. While both text and exemplar images have been used independently in previous efforts, their combined utilization remains unexplored. Simultaneously accommodating both conditions poses a significant challenge due to the inherent balance required between editability and subject fidelity. To tackle this challenge, we propose a two-step approach DreamInpainter. First, we compute dense subject features to ensure accurate subject replication. Then, we employ a discriminative token selection module to eliminate redundant subject details, preserving the subject's identity while allowing changes according to other conditions such as mask shape and text prompts. Additionally, we introduce a decoupling regularization technique to enhance text control in the presence of exemplar images. Our extensive experiments demonstrate the superior performance of our method in terms of visual quality, identity preservation, and text control, showcasing its effectiveness in the context of text-guided subject-driven image inpainting.

DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models

TL;DR

This work introduces DreamInpainter, a diffusion-model-based framework for Text-Guided Subject-Driven Image Inpainting that uses both text prompts and a reference exemplar to guide inpainting. It tackles the copy-paste risk and limited text control by (1) extracting discriminative dense subject features from the UNet downstack, (2) selecting the top-K tokens to preserve identity while enabling edits, and (3) applying a decoupling regularization that imposes text-driven restoration over the entire image. Empirical results on COCOEE and DreamBoothEE show improved realism and stronger alignment with text prompts, along with ablations validating the effectiveness of token selection and decoupling regularization. The approach enables a range of applications from faithful subject insertion to stylized and attribute-edited inpainted content, contributing a practical solution for balanced, controllable inpainting with dual guidance signals.

Abstract

This study introduces Text-Guided Subject-Driven Image Inpainting, a novel task that combines text and exemplar images for image inpainting. While both text and exemplar images have been used independently in previous efforts, their combined utilization remains unexplored. Simultaneously accommodating both conditions poses a significant challenge due to the inherent balance required between editability and subject fidelity. To tackle this challenge, we propose a two-step approach DreamInpainter. First, we compute dense subject features to ensure accurate subject replication. Then, we employ a discriminative token selection module to eliminate redundant subject details, preserving the subject's identity while allowing changes according to other conditions such as mask shape and text prompts. Additionally, we introduce a decoupling regularization technique to enhance text control in the presence of exemplar images. Our extensive experiments demonstrate the superior performance of our method in terms of visual quality, identity preservation, and text control, showcasing its effectiveness in the context of text-guided subject-driven image inpainting.
Paper Structure (17 sections, 5 equations, 18 figures, 3 tables)

This paper contains 17 sections, 5 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: Different from previous works, which accepts only at most one condition for inpainting, we consider the task of Text-Guided Subject-Driven Image Inpainting, the generalization of previous tasks with an objective to inpaint an image using both exemplar images and text description.
  • Figure 2: The training diagram of our method. On the left, we show the main training pipeline for inpainting and noise is added to the object only. On the right, we present the decoupling regularization and we add noise on the whole image. We first extract feature of reference object with the downstack of the Unet in diffusion model and perform token selection to avoid copy-paste.
  • Figure 3: Copy-paste artifacts when using all tokens from UNet. The model learns the trivial mapping which copies the reference object to the masked region directly.
  • Figure 4: Text-guided subject-driven image inpainting results. Note that the strong inpainting baselines rombach2022highyang2023paint does not support both text and image guidance.
  • Figure 5: Comparison to state-of-the-art inpainting methods. Stable Inpaint rombach2022high takes text as condition and Blended Diffusion avrahami2022blended and Paint-by-Example (PBE) yang2023paint take reference images as condition. By contrast, our method takes both text and image. We just use a short word as our text input, such as cup, dog and bird. The image feature $c_{x_r}$ will inject the detailed information.
  • ...and 13 more figures