Table of Contents
Fetching ...

Improving Text-guided Object Inpainting with Semantic Pre-inpainting

Yifu Chen, Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Zhineng Chen, Tao Mei

TL;DR

This paper proposes to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features.

Abstract

Recent years have witnessed the success of large text-to-image diffusion models and their remarkable potential to generate high-quality images. The further pursuit of enhancing the editability of images has sparked significant interest in the downstream task of inpainting a novel object described by a text prompt within a designated region in the image. Nevertheless, the problem is not trivial from two aspects: 1) Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; 2) The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model. In this paper, we propose to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object inpainting. Technically, the semantic inpainter is trained to predict the semantic features of the target object conditioning on unmasked context and text prompt. The outputs of the semantic inpainter then act as the informative visual prompts to guide high-fieldity object generation through a reference adapter layer, leading to controllable object inpainting. Extensive evaluations on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against the state-of-the-art methods. Code is available at \url{https://github.com/Nnn-s/CATdiffusion}.

Improving Text-guided Object Inpainting with Semantic Pre-inpainting

TL;DR

This paper proposes to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features.

Abstract

Recent years have witnessed the success of large text-to-image diffusion models and their remarkable potential to generate high-quality images. The further pursuit of enhancing the editability of images has sparked significant interest in the downstream task of inpainting a novel object described by a text prompt within a designated region in the image. Nevertheless, the problem is not trivial from two aspects: 1) Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; 2) The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model. In this paper, we propose to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object inpainting. Technically, the semantic inpainter is trained to predict the semantic features of the target object conditioning on unmasked context and text prompt. The outputs of the semantic inpainter then act as the informative visual prompts to guide high-fieldity object generation through a reference adapter layer, leading to controllable object inpainting. Extensive evaluations on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against the state-of-the-art methods. Code is available at \url{https://github.com/Nnn-s/CATdiffusion}.
Paper Structure (28 sections, 10 equations, 6 figures, 4 tables)

This paper contains 28 sections, 10 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An illustration of conventional object inpainting framework and our proposed CAT-Diffusion. (a) Typical framework commonly feeds a masked image with the original object removed, a text prompt describing the target object (e.g., "cat") and a binary mask indicating the designated region to be inpainted into an individual diffusion model for object inpainting. (b) Our proposed CAT-Diffusion additionally pre-inpaint the semantic features of the target object via Transformer-based semantic inpainter, and the derived features are leveraged to steer the diffusion model through the reference adapter layer for controllable and high-fidelity object inpainting.
  • Figure 2: The framework of our CAT-Diffusion. Specifically, a pre-trained image encoder is first employed to extract the visual features of the masked image. Then, a novel semantic inpainter takes these visual features and a text prompt as inputs, and pre-inpaints the semantic features of the desired object in a multi-modal feature space, thereby aligning the prompt and the visual object in addition to the U-Net regardless of denoising timesteps. To achieve this goal, knowledge distillation is adopted to transfer the multi-modal knowledge from a teacher model to the semantic inpainter. Finally, an object inpainting diffusion model equipped with a reference adapter layer is steered by the aligned semantic features for controllable object inpainting in visual space.
  • Figure 3: Examples generated by Stable Diffusion, Stable Diffusion Inpainting, GLIDE, Blended Diffusion, Blended Latent Diffusion and our proposed CAT-Diffusion with segmentation mask or bounding box mask.
  • Figure 4: (a) The distributions of cosine similarity between the semantic features before/after the semantic inapinter within the masked region and the corresponding ground-truth ones. (b) Diverse inpainted images generated by our proposed CAT-Diffusion using different random seeds.
  • Figure 5: Comparisons with SmartBrush.
  • ...and 1 more figures