HD-Painter: High-Resolution and Prompt-Faithful Text-Guided Image Inpainting with Diffusion Models

Hayk Manukyan; Andranik Sargsyan; Barsegh Atanyan; Zhangyang Wang; Shant Navasardyan; Humphrey Shi

HD-Painter: High-Resolution and Prompt-Faithful Text-Guided Image Inpainting with Diffusion Models

Hayk Manukyan, Andranik Sargsyan, Barsegh Atanyan, Zhangyang Wang, Shant Navasardyan, Humphrey Shi

TL;DR

The Prompt-Aware Introverted Attention layer enhancing self-attention scores by prompt information resulting in better text aligned generations is designed and the Reweighting Attention Score Guidance mechanism seamlessly integrating a post-hoc sampling strategy into the general form of DDIM is introduced.

Abstract

Recent progress in text-guided image inpainting, based on the unprecedented success of text-to-image diffusion models, has led to exceptionally realistic and visually plausible results. However, there is still significant potential for improvement in current text-to-image inpainting models, particularly in better aligning the inpainted area with user prompts and performing high-resolution inpainting. Therefore, we introduce HD-Painter, a training free approach that accurately follows prompts and coherently scales to high resolution image inpainting. To this end, we design the Prompt-Aware Introverted Attention (PAIntA) layer enhancing self-attention scores by prompt information resulting in better text aligned generations. To further improve the prompt coherence we introduce the Reweighting Attention Score Guidance (RASG) mechanism seamlessly integrating a post-hoc sampling strategy into the general form of DDIM to prevent out-of-distribution latent shifts. Moreover, HD-Painter allows extension to larger scales by introducing a specialized super-resolution technique customized for inpainting, enabling the completion of missing regions in images of up to 2K resolution. Our experiments demonstrate that HD-Painter surpasses existing state-of-the-art approaches quantitatively and qualitatively across multiple metrics and a user study. Code is publicly available at: https://github.com/Picsart-AI-Research/HD-Painter

HD-Painter: High-Resolution and Prompt-Faithful Text-Guided Image Inpainting with Diffusion Models

TL;DR

Abstract

Paper Structure (27 sections, 16 equations, 22 figures, 2 tables)

This paper contains 27 sections, 16 equations, 22 figures, 2 tables.

Introduction
Related Work
Image Inpainting
Inpainting-Specific Architectural Blocks
Post-Hoc Guidance in Diffusion Process
Method
Stable Diffusion and Stable Inpainting
HD-Painter: Overview
Prompt-Aware Introverted Attention (PAIntA)
Reweighting Attention Score Guidance (RASG)
Inpainting-Specialized Conditional Super-Resolution
Experiments
User Study
Implementation Details
Experimental Setup
...and 12 more sections

Figures (22)

Figure 1: High-resolution (the large side is $2048$ in all these examples) text-guided image inpainting results with our approach. The method is able to faithfully fill the masked region according to the prompt even if the combination of the prompt and the known region is highly unlikely. Zoom in to view high-resolution details.
Figure 2: Our method has two stages: image completiton, and inpainting-specialized super-resolution ($\times 4$). For image completion in each diffusion step we denoise the latent $x_t$ by conditioning on the inpainting mask $M$ and the masked downscaled image $I^M = down(I)\odot(1-M)\in\mathbb{R}^{\frac{H}{4}\times \frac{W}{4}\times 3}$ (encoded with the VAE encoder $\mathcal{E}$). To make better alignement with the given prompt our PAIntA block is applied instead of self-attention layers. After predicting the denoised $x^{pred}_0$ in each step $t$, we provide it to our RASG guidance mechanism to estimate the next latent $x_{t-1}$. For inpainting-specific super resolution we condition the high-resolution latent $X_t$ denoising process by the lower resolution inpainted result $I^c_{low}$, followed by blending $X^{pred}_0\odot M + \mathcal{E}(I)\odot(1-M)$. Finally we get $I^c$ by Poisson blending the decoded output with the original image $I$.
Figure 3: (a) PAIntA block takes an input tensor $X\in\mathbb{R}^{h\times w\times 3}$ and the CLIP embeddings of $\tau$. After computing the self- and cross-attention scores $A_{self}$ and $A_{cross}$, we update the former (Eq. \ref{['eq:self_attn_update']}) by scaling with the normalized values $\{c_j\}_{j=1}^{hw}$ obtained from $S_{cross} = SoftMax(A_{cross})$. Finally the the updated attention scores $\tilde{A}_{self}$ are used for the convex combination of the values $V_s$ to get the residual of PAIntA's output. (b) RASG mechanism takes the predicted scaled denoised latent $\sqrt{\alpha_{t-1}}x^{pred}_0 = \frac{\sqrt{\alpha_{t-1}}}{\sqrt{\alpha_t}}\left(x_t - \sqrt{1-\alpha_t}\epsilon_{\theta}(x_t)\right)$ and guides the $x_{t-1}$ estimation process towards minimization of $S(x_t)$ defined by Eq. \ref{['eq:defining_function_S']}. Gradient reweighting makes the gradient term close to being sampled from $\mathcal{N}(\textbf{0},\textbf{1})$ (green area) by so ensuring the domain preservation (blue area).
Figure 4: Comparison with state-of-the-art text-guided inpainting methods. Zoom in for details. For more comparison see Appendix.
Figure 5: Total votes of each method based on our user study for prompt alignment and overall quality. Our method HD-Painter has a clear advantage over all competitors.
...and 17 more figures

HD-Painter: High-Resolution and Prompt-Faithful Text-Guided Image Inpainting with Diffusion Models

TL;DR

Abstract

HD-Painter: High-Resolution and Prompt-Faithful Text-Guided Image Inpainting with Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (22)