Table of Contents
Fetching ...

CorrFill: Enhancing Faithfulness in Reference-based Inpainting with Correspondence Guidance in Diffusion Models

Kuan-Hung Liu, Cheng-Kun Yang, Min-Hung Chen, Yu-Lun Liu, Yen-Yu Lin

TL;DR

CorrFill tackles faithful reference-based inpainting by introducing a training-free module that imposes explicit correspondence constraints between a reference and a damaged target using self-attention in diffusion models. It stitches the reference and target, derives correspondences from aggregated attention maps, and refines them in a cyclic loop while guiding denoising with attention masks and latent-tensor optimization. The method improves faithfulness to the reference across several baselines on RealEstate10K and MegaDepth, demonstrating substantial PSNR/SSIM gains and reduced artifacts, though it faces challenges with complex geometry and large viewpoint changes. This work offers a practical, training-free approach to improve the alignment between references and inpainted results, with potential for broader downstream diffusion-model controllability.

Abstract

In the task of reference-based image inpainting, an additional reference image is provided to restore a damaged target image to its original state. The advancement of diffusion models, particularly Stable Diffusion, allows for simple formulations in this task. However, existing diffusion-based methods often lack explicit constraints on the correlation between the reference and damaged images, resulting in lower faithfulness to the reference images in the inpainting results. In this work, we propose CorrFill, a training-free module designed to enhance the awareness of geometric correlations between the reference and target images. This enhancement is achieved by guiding the inpainting process with correspondence constraints estimated during inpainting, utilizing attention masking in self-attention layers and an objective function to update the input tensor according to the constraints. Experimental results demonstrate that CorrFill significantly enhances the performance of multiple baseline diffusion-based methods, including state-of-the-art approaches, by emphasizing faithfulness to the reference images.

CorrFill: Enhancing Faithfulness in Reference-based Inpainting with Correspondence Guidance in Diffusion Models

TL;DR

CorrFill tackles faithful reference-based inpainting by introducing a training-free module that imposes explicit correspondence constraints between a reference and a damaged target using self-attention in diffusion models. It stitches the reference and target, derives correspondences from aggregated attention maps, and refines them in a cyclic loop while guiding denoising with attention masks and latent-tensor optimization. The method improves faithfulness to the reference across several baselines on RealEstate10K and MegaDepth, demonstrating substantial PSNR/SSIM gains and reduced artifacts, though it faces challenges with complex geometry and large viewpoint changes. This work offers a practical, training-free approach to improve the alignment between references and inpainted results, with potential for broader downstream diffusion-model controllability.

Abstract

In the task of reference-based image inpainting, an additional reference image is provided to restore a damaged target image to its original state. The advancement of diffusion models, particularly Stable Diffusion, allows for simple formulations in this task. However, existing diffusion-based methods often lack explicit constraints on the correlation between the reference and damaged images, resulting in lower faithfulness to the reference images in the inpainting results. In this work, we propose CorrFill, a training-free module designed to enhance the awareness of geometric correlations between the reference and target images. This enhancement is achieved by guiding the inpainting process with correspondence constraints estimated during inpainting, utilizing attention masking in self-attention layers and an objective function to update the input tensor according to the constraints. Experimental results demonstrate that CorrFill significantly enhances the performance of multiple baseline diffusion-based methods, including state-of-the-art approaches, by emphasizing faithfulness to the reference images.
Paper Structure (25 sections, 5 equations, 6 figures, 2 tables)

This paper contains 25 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview. (a) The reference and target images are stitched side by side, serving as inputs to the model. (b) Reference-based inpainting using an inpainting fine-tuned Stable Diffusion rombach2022LDM that employs our training-free correspondence guidance. (c) Our method captures more reliable correlations between references and targets than previous methods cao2024leftrefill, thereby avoiding incorrect geometry and unwanted objects.
  • Figure 2: Approach overview. CorrFill jointly guides the inpainting and refines the estimated correspondences at each denoising step. The noise tensor $N^\epsilon$, the downscaled mask $M^\epsilon$ and the encoded stitched image $\epsilon(I_{ref;tar})$ are concatenated into input latent tensor $z_T$. For each denoising step, the self-attention scores from the diffusion model are aggregated into a matching map $C_t$, and the correspondence $P_t$ are computed from $C_t$, where $P_t$ are used to guide the subsequent denoising step. For visual clarity, we use the real images to picture $\epsilon (I_{\text{ref;tar}})$ and $z_0$.
  • Figure 3: Correspondences in the early stage. The image on the left highlights the masked regions of the target and their most attended positions in the reference, indicated by colors, at the very first denoising step. The image on the right depicts a few correspondences computed at the first denoising step.
  • Figure 4: Correspondence guidance in the diffusion U-Net. At each denoising step $t$, the denoising process is guided by the correspondences estimated in the previous step, $P_{t+1}$, through attention masking with $m_t$ and optimizing $z_t$ using the objective function $S(\cdot)$. The generated attention maps $A_t^{\text{tar2ref}}$ are then employed to further refine the estimated correspondences $P_t$ by updating the matching map $C_t$.
  • Figure 5: Qualitative results. We present the qualitative results with four different baselines and their counterparts integrated with our method on two datasets. We highlight the problematic regions in the results of the baseline methods that our approach can effectively address by enclosing them in red boxes. The inpainting masks are generated based on the content in image pairs.
  • ...and 1 more figures