Table of Contents
Fetching ...

BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, Qiang Xu

TL;DR

BrushNet tackles semantic incoherence in diffusion-based image inpainting by adding a dedicated masked-image guidance branch that decouples masked features from the generative UNet. It is plug-and-play with any pretrained diffusion model, employing a VAE-encoded masked image, layer-wise feature injection, and blurred mask blending to preserve unmasked regions. The authors introduce BrushData and BrushBench to enable segmentation-based training and evaluation and demonstrate state-of-the-art performance on a suite of metrics including image quality, region preservation, and text alignment across random and segmentation-based masks. Limitations include dependence on base models and challenges with irregular masks and misaligned prompts, with planned future work and attention to ethical considerations.

Abstract

Image inpainting, the process of restoring corrupted images, has seen significant advancements with the advent of diffusion models (DMs). Despite these advancements, current DM adaptations for inpainting, which involve modifications to the sampling strategy or the development of inpainting-specific DMs, frequently suffer from semantic inconsistencies and reduced image quality. Addressing these challenges, our work introduces a novel paradigm: the division of masked image features and noisy latent into separate branches. This division dramatically diminishes the model's learning load, facilitating a nuanced incorporation of essential masked image information in a hierarchical fashion. Herein, we present BrushNet, a novel plug-and-play dual-branch model engineered to embed pixel-level masked image features into any pre-trained DM, guaranteeing coherent and enhanced image inpainting outcomes. Additionally, we introduce BrushData and BrushBench to facilitate segmentation-based inpainting training and performance assessment. Our extensive experimental analysis demonstrates BrushNet's superior performance over existing models across seven key metrics, including image quality, mask region preservation, and textual coherence.

BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

TL;DR

BrushNet tackles semantic incoherence in diffusion-based image inpainting by adding a dedicated masked-image guidance branch that decouples masked features from the generative UNet. It is plug-and-play with any pretrained diffusion model, employing a VAE-encoded masked image, layer-wise feature injection, and blurred mask blending to preserve unmasked regions. The authors introduce BrushData and BrushBench to enable segmentation-based training and evaluation and demonstrate state-of-the-art performance on a suite of metrics including image quality, region preservation, and text alignment across random and segmentation-based masks. Limitations include dependence on base models and challenges with irregular masks and misaligned prompts, with planned future work and attention to ethical considerations.

Abstract

Image inpainting, the process of restoring corrupted images, has seen significant advancements with the advent of diffusion models (DMs). Despite these advancements, current DM adaptations for inpainting, which involve modifications to the sampling strategy or the development of inpainting-specific DMs, frequently suffer from semantic inconsistencies and reduced image quality. Addressing these challenges, our work introduces a novel paradigm: the division of masked image features and noisy latent into separate branches. This division dramatically diminishes the model's learning load, facilitating a nuanced incorporation of essential masked image information in a hierarchical fashion. Herein, we present BrushNet, a novel plug-and-play dual-branch model engineered to embed pixel-level masked image features into any pre-trained DM, guaranteeing coherent and enhanced image inpainting outcomes. Additionally, we introduce BrushData and BrushBench to facilitate segmentation-based inpainting training and performance assessment. Our extensive experimental analysis demonstrates BrushNet's superior performance over existing models across seven key metrics, including image quality, mask region preservation, and textual coherence.
Paper Structure (25 sections, 5 equations, 7 figures, 5 tables)

This paper contains 25 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Performance comparisons of BrushNet and previous image inpainting methods across various inpainting tasks: (I) Random Mask (< 50% masked), (II) Random Mask (> 50% masked), (III) Segmentation Mask Inside-Inpainting, (IV) Segmentation Mask Outside-Inpainting. Each group of results contains an artificial image (left) and a natural image (right) with $6$ inpainting methods: (b) Blended Latent Diffusion (BLD) avrahami2023blended, (c) Stable Diffusion Inpainting (SDI) Rombach_2022_CVPR, (d) HD-Painter (HDP) manukyan2023hd, (e) PowerPaint (PP) zhuang2023task, (f) ControlNet-Inpainting (CNI) zhang2023adding, and (g) Ours.
  • Figure 2: Comparison of previous inpainting architectures and BrushNet.
  • Figure 3: Model overview. Our model outputs an inpainted image given the mask and masked image input. Firstly, we downsample the mask to accommodate the size of the latent, and input the masked image to the VAE encoder to align the distribution of latent space. Then, noisy latent, masked image latent, and downsampled mask are concatenated as the input of BrushNet. The feature extracted from BrushNet is added to pretrained UNet layer by layer after a zero convolution block zhang2023adding. After denoising, the generated image and masked image are blended with a blurred mask.
  • Figure 4: Benchmark overview. I and II separately show natural and artificial images, masks, and caption of BrushBench. (a) to (d) show images of humans, animals, indoor scenarios, and outdoor scenarios. Each group of images shows the original image, inside-inpainting mask, and outside-inpainting mask, with an image caption on the top. III show image, mask, and caption from EditBench wang2023imagen, with (e) for generated images and (f) for natural images. The images are randomly selected from both benchmarks.
  • Figure 5: Comparison of previous inpainting methods and BrushNet on various image domain. A detailed explanation of compared methods is in Fig. \ref{['fig:teaser']}.
  • ...and 2 more figures