Table of Contents
Fetching ...

SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model

Shaoan Xie, Zhifei Zhang, Zhe Lin, Tobias Hinz, Kun Zhang

TL;DR

SmartBrush introduces a diffusion-based framework for object inpainting guided by both text and shape cues. It enhances control through mask precision levels and foreground mask prediction to preserve background, and leverages multi-task training with text-to-image data to improve realism and alignment. The approach achieves state-of-the-art results on OpenImages and MSCOCO datasets, outperforming baselines in visual quality, mask fidelity, and background preservation. The work thoughtfully addresses text and mask alignment challenges, offering practical benefits for fine-grained, controllable inpainting tasks.

Abstract

Generic image inpainting aims to complete a corrupted image by borrowing surrounding information, which barely generates novel content. By contrast, multi-modal inpainting provides more flexible and useful controls on the inpainted content, \eg, a text prompt can be used to describe an object with richer attributes, and a mask can be used to constrain the shape of the inpainted object rather than being only considered as a missing area. We propose a new diffusion-based model named SmartBrush for completing a missing region with an object using both text and shape-guidance. While previous work such as DALLE-2 and Stable Diffusion can do text-guided inapinting they do not support shape guidance and tend to modify background texture surrounding the generated object. Our model incorporates both text and shape guidance with precision control. To preserve the background better, we propose a novel training and sampling strategy by augmenting the diffusion U-net with object-mask prediction. Lastly, we introduce a multi-task training strategy by jointly training inpainting with text-to-image generation to leverage more training data. We conduct extensive experiments showing that our model outperforms all baselines in terms of visual quality, mask controllability, and background preservation.

SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model

TL;DR

SmartBrush introduces a diffusion-based framework for object inpainting guided by both text and shape cues. It enhances control through mask precision levels and foreground mask prediction to preserve background, and leverages multi-task training with text-to-image data to improve realism and alignment. The approach achieves state-of-the-art results on OpenImages and MSCOCO datasets, outperforming baselines in visual quality, mask fidelity, and background preservation. The work thoughtfully addresses text and mask alignment challenges, offering practical benefits for fine-grained, controllable inpainting tasks.

Abstract

Generic image inpainting aims to complete a corrupted image by borrowing surrounding information, which barely generates novel content. By contrast, multi-modal inpainting provides more flexible and useful controls on the inpainted content, \eg, a text prompt can be used to describe an object with richer attributes, and a mask can be used to constrain the shape of the inpainted object rather than being only considered as a missing area. We propose a new diffusion-based model named SmartBrush for completing a missing region with an object using both text and shape-guidance. While previous work such as DALLE-2 and Stable Diffusion can do text-guided inapinting they do not support shape guidance and tend to modify background texture surrounding the generated object. Our model incorporates both text and shape guidance with precision control. To preserve the background better, we propose a novel training and sampling strategy by augmenting the diffusion U-net with object-mask prediction. Lastly, we introduce a multi-task training strategy by jointly training inpainting with text-to-image generation to leverage more training data. We conduct extensive experiments showing that our model outperforms all baselines in terms of visual quality, mask controllability, and background preservation.
Paper Structure (14 sections, 10 equations, 7 figures, 2 tables)

This paper contains 14 sections, 10 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Our method generates high-quality object inpainting results. Different mask precision levels allowing users to either provide exact masks (top row) or to use a rough mask outline (bottom row). Compared to existing methods, our method generates more realistic images, follows accurate masks more closely (top row) and shows better background preservation for coarse masks (bottom row).
  • Figure 2: Text and shape guided object inpainting. Given an image $x_0$, accurate mask $m$ and object description $d$, we transform the mask $m$ to different precision levels (from accurate to coarse) as $m_s$. We add noise in the masked region to provide rich background information to the diffusion model and train the model to predict the added noise as well as the accurate mask $m$. During inference, we apply the diffusion model repeatedly until $t=0$.
  • Figure 3: Comparison of text and shape guided inpainting.
  • Figure 4: We ask users to choose the generation that best aligns with the mask and input text, and looks most realistic. Our method SmartBrush outperforms the baselines by a large margin.
  • Figure 5: Mask precision control samples with prompt "astronaut". As we increase the mask type, our method give more freedom to the model and the outputs gradually become different from the input object shape mask.
  • ...and 2 more figures