Table of Contents
Fetching ...

FaithFill: Faithful Inpainting for Object Completion Using a Single Reference Image

Rupayan Mallick, Amr Abdalla, Sarah Adel Bargal

TL;DR

FaithFill addresses faithful object completion from a single reference image by integrating segmentation, NeRF-based view synthesis, and LoRA-finetuned diffusion inpainting. By producing multiple views of the reference object and constraining the inpainting updates, the method preserves shape, texture, color, and background while filling occluded regions. Evaluations on DreamBooth and a dedicated FaithFill dataset show improvements across standard metrics, human judgments, and GPT-based assessments relative to state-of-the-art baselines. The work also contributes the FaithFill dataset and demonstrates data-efficient, faithful editing possible with diffusion models.

Abstract

We present FaithFill, a diffusion-based inpainting object completion approach for realistic generation of missing object parts. Typically, multiple reference images are needed to achieve such realistic generation, otherwise the generation would not faithfully preserve shape, texture, color, and background. In this work, we propose a pipeline that utilizes only a single input reference image -having varying lighting, background, object pose, and/or viewpoint. The singular reference image is used to generate multiple views of the object to be inpainted. We demonstrate that FaithFill produces faithful generation of the object's missing parts, together with background/scene preservation, from a single reference image. This is demonstrated through standard similarity metrics, human judgement, and GPT evaluation. Our results are presented on the DreamBooth dataset, and a novel proposed dataset.

FaithFill: Faithful Inpainting for Object Completion Using a Single Reference Image

TL;DR

FaithFill addresses faithful object completion from a single reference image by integrating segmentation, NeRF-based view synthesis, and LoRA-finetuned diffusion inpainting. By producing multiple views of the reference object and constraining the inpainting updates, the method preserves shape, texture, color, and background while filling occluded regions. Evaluations on DreamBooth and a dedicated FaithFill dataset show improvements across standard metrics, human judgments, and GPT-based assessments relative to state-of-the-art baselines. The work also contributes the FaithFill dataset and demonstrates data-efficient, faithful editing possible with diffusion models.

Abstract

We present FaithFill, a diffusion-based inpainting object completion approach for realistic generation of missing object parts. Typically, multiple reference images are needed to achieve such realistic generation, otherwise the generation would not faithfully preserve shape, texture, color, and background. In this work, we propose a pipeline that utilizes only a single input reference image -having varying lighting, background, object pose, and/or viewpoint. The singular reference image is used to generate multiple views of the object to be inpainted. We demonstrate that FaithFill produces faithful generation of the object's missing parts, together with background/scene preservation, from a single reference image. This is demonstrated through standard similarity metrics, human judgement, and GPT evaluation. Our results are presented on the DreamBooth dataset, and a novel proposed dataset.
Paper Structure (20 sections, 3 equations, 6 figures, 1 table)

This paper contains 20 sections, 3 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Inpainting results of different missing regions using FaithFill compared to some state-of-the-art techniques. While state-of-the-art techniques provide high-quality plausible inpainting results, they may not be faithful to the object. This is observed in both methods that do not use a reference image (Stable Inpainting and Blended Latent Diffusion), and methods that use a single reference/exemplar image (Paint By Example). The first row is a sample image from the DreamBooth dataset, and the second row is a sample image from our proposed FaithFill dataset. More qualitative examples for additional state-of-the-art techniques are presented later in the paper.
  • Figure 2: FaithFill Finetuning Pipeline. This figure presents the schematic overview of our finetuning pipeline. Given an input image $I_{ref}$ we generate $n$ different images {$x_{1}, x_{2}....x_{n}$} from different viewpoints (VP) using a view generator based on NeRFs. The views $\{ x_{1}, x_{2}....x_{n} \}$ are then multiplied with randomly generated masks $\{ m_{1}, m_{2}....m_{n} \}$. The randomly masked views are used as an input along with the text to the Inpainting Module. In this module we finetune the LoRA adapted layers instead of finetuning the whole model. Finetuning is governed by a reconstruction loss with respect to the unmasked generated views.
  • Figure 3: AMT Interface (Left). This is a screenshot from the Amazon Mechanical Turk interface that we used to launch our user study. GPT-4o Setup (Right). We asked GPT-4o to compare two AI generated images (ours vs a baseline-generated image) and decide which one is more similar to the target image. It was prompted to provide the index of the selected image as well as the reason for the selection. This figure presents a sample result.
  • Figure 4: Qualitative Results. Inpainting results for different missing regions using FaithFill vs. state-of-the-art techniques. The first four rows are images from the DreamBooth dataset, and the next four rows are images from our FaithFill dataset. Unlike Figure \ref{['fig:Intro']} where we contrast state-of-the-art techniques using (or not using) a reference image, here we contrast results against the three state-of-the-art techniques that use exactly one reference image. We note that Stable Inpaining FT is an implementation of RealFill under a one-image configuration.
  • Figure 5: Human Judgement and GPT Evaluation Results. This figure shows the results of our user study (Left) on Amazon Mechanical Turk and GPT-4o study (Right). Every bar presents the percentage of times FaithFill generations were favored compared to the generations from state-of-the-art techniques.
  • ...and 1 more figures