Table of Contents
Fetching ...

RealFill: Reference-Driven Generation for Authentic Image Completion

Luming Tang, Nataniel Ruiz, Qinghao Chu, Yuanzhen Li, Aleksander Holynski, David E. Jacobs, Bharath Hariharan, Yael Pritch, Neal Wadhwa, Kfir Aberman, Michael Rubinstein

TL;DR

RealFill tackles Authentic Image Completion by finetuning a pretrained inpainting diffusion model on a small set of reference images and a target, enabling the model to encode scene content, lighting, and style. It then completes missing regions via diffusion sampling, guided by a Correspondence-Based Seed Selection that ranks outputs by correspondences to the references. The authors introduce RealBench, a 33-scene dataset for inpainting and outpainting with ground-truth, and show that RealFill significantly outperforms prompt-based and reference-based baselines across multiple similarity metrics. The approach yields faithful reconstructions even with large viewpoint and appearance changes, highlighting its potential for authentic scene restoration in practical photography contexts. However, limitations include training speed, failure modes under extreme geometry gaps, and challenges with fine-grained details like text or faces, pointing to future improvements in speed and robustness.

Abstract

Recent advances in generative imagery have brought forth outpainting and inpainting models that can produce high-quality, plausible image content in unknown regions. However, the content these models hallucinate is necessarily inauthentic, since they are unaware of the true scene. In this work, we propose RealFill, a novel generative approach for image completion that fills in missing regions of an image with the content that should have been there. RealFill is a generative inpainting model that is personalized using only a few reference images of a scene. These reference images do not have to be aligned with the target image, and can be taken with drastically varying viewpoints, lighting conditions, camera apertures, or image styles. Once personalized, RealFill is able to complete a target image with visually compelling contents that are faithful to the original scene. We evaluate RealFill on a new image completion benchmark that covers a set of diverse and challenging scenarios, and find that it outperforms existing approaches by a large margin. Project page: https://realfill.github.io

RealFill: Reference-Driven Generation for Authentic Image Completion

TL;DR

RealFill tackles Authentic Image Completion by finetuning a pretrained inpainting diffusion model on a small set of reference images and a target, enabling the model to encode scene content, lighting, and style. It then completes missing regions via diffusion sampling, guided by a Correspondence-Based Seed Selection that ranks outputs by correspondences to the references. The authors introduce RealBench, a 33-scene dataset for inpainting and outpainting with ground-truth, and show that RealFill significantly outperforms prompt-based and reference-based baselines across multiple similarity metrics. The approach yields faithful reconstructions even with large viewpoint and appearance changes, highlighting its potential for authentic scene restoration in practical photography contexts. However, limitations include training speed, failure modes under extreme geometry gaps, and challenges with fine-grained details like text or faces, pointing to future improvements in speed and robustness.

Abstract

Recent advances in generative imagery have brought forth outpainting and inpainting models that can produce high-quality, plausible image content in unknown regions. However, the content these models hallucinate is necessarily inauthentic, since they are unaware of the true scene. In this work, we propose RealFill, a novel generative approach for image completion that fills in missing regions of an image with the content that should have been there. RealFill is a generative inpainting model that is personalized using only a few reference images of a scene. These reference images do not have to be aligned with the target image, and can be taken with drastically varying viewpoints, lighting conditions, camera apertures, or image styles. Once personalized, RealFill is able to complete a target image with visually compelling contents that are faithful to the original scene. We evaluate RealFill on a new image completion benchmark that covers a set of diverse and challenging scenarios, and find that it outperforms existing approaches by a large margin. Project page: https://realfill.github.io
Paper Structure (13 sections, 3 equations, 14 figures, 3 tables)

This paper contains 13 sections, 3 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Training and inference pipelines of RealFill. RealFill's inputs are a target image to be filled and a few reference images of the same scene. We first finetune LoRA weights of a pretrained inpainting diffusion model on the reference and target images (with random patches masked out). Then, we use the adapted model to fill the desired region of the target image, resulting in a faithful, high-quality output. For example, the girl's crown is recovered in the target image, despite the girl being in very different poses in the reference images.
  • Figure 2: Reference-based outpainting with RealFill. Given the reference images on the left, RealFill outpaints the corresponding target images on the right. The region inside the white box is provided to the network as known pixels, and the region outside the white box is generated. RealFill produces high-quality images that are faithful to the references, even when there are dramatic differences between the references and targets such as changes in viewpoint, aperture, lighting, image style, and object motion.
  • Figure 3: Reference-based inpainting with RealFill. Given the references on the left, RealFill can not only remove undesired objects in the target image and reveal the occluded contents faithfully (left column), but also insert objects into the scene despite significant viewpoint changes between reference and target images (right column). In the bottom left example, the reference and target images have different defocus blurs. RealFill not only recovers the buildings behind the mug, but also keeps the same amount of blur as in the target image.
  • Figure 4: Qualitative comparison of RealFill and baselines. Transparent white masks are overlayed on the unaltered known regions of the target images. Paint-by-Example loses fidelity with the reference images because it relies on CLIP embeddings, which only capture high-level semantic information. TransFill outputs low quality images due to the lack of a good image prior and the limitations of its geometry-based pipeline. While Generative Fill produces plausible results, they are inconsistent with the reference images because prompts have limited expressiveness. In contrast, RealFill generates high-quality results that have high fidelity with respect to the reference images.
  • Figure 5: Correspondence-based seed selection. Given the reference images on the left, we show multiple RealFill outputs on the right along with the number of matched key points. We can see that fewer matches correlate with lower-quality outputs that are more divergent from the ground-truth.
  • ...and 9 more figures