Table of Contents
Fetching ...

IMFine: 3D Inpainting via Geometry-guided Multi-view Refinement

Zhihao Shi, Dong Huo, Yuhongze Zhou, Kejia Yin, Yan Min, Juwei Lu, Xinxin Zuo

TL;DR

A robust 3D inpainting pipeline that incorporates geometric priors and a multi-view refinement network trained via test-time adaptation is proposed and a novel inpainting mask detection technique to derive targeted inpainting masks from object masks is developed, boosting the performance in handling unconstrained scenes.

Abstract

Current 3D inpainting and object removal methods are largely limited to front-facing scenes, facing substantial challenges when applied to diverse, "unconstrained" scenes where the camera orientation and trajectory are unrestricted. To bridge this gap, we introduce a novel approach that produces inpainted 3D scenes with consistent visual quality and coherent underlying geometry across both front-facing and unconstrained scenes. Specifically, we propose a robust 3D inpainting pipeline that incorporates geometric priors and a multi-view refinement network trained via test-time adaptation, building on a pre-trained image inpainting model. Additionally, we develop a novel inpainting mask detection technique to derive targeted inpainting masks from object masks, boosting the performance in handling unconstrained scenes. To validate the efficacy of our approach, we create a challenging and diverse benchmark that spans a wide range of scenes. Comprehensive experiments demonstrate that our proposed method substantially outperforms existing state-of-the-art approaches.

IMFine: 3D Inpainting via Geometry-guided Multi-view Refinement

TL;DR

A robust 3D inpainting pipeline that incorporates geometric priors and a multi-view refinement network trained via test-time adaptation is proposed and a novel inpainting mask detection technique to derive targeted inpainting masks from object masks is developed, boosting the performance in handling unconstrained scenes.

Abstract

Current 3D inpainting and object removal methods are largely limited to front-facing scenes, facing substantial challenges when applied to diverse, "unconstrained" scenes where the camera orientation and trajectory are unrestricted. To bridge this gap, we introduce a novel approach that produces inpainted 3D scenes with consistent visual quality and coherent underlying geometry across both front-facing and unconstrained scenes. Specifically, we propose a robust 3D inpainting pipeline that incorporates geometric priors and a multi-view refinement network trained via test-time adaptation, building on a pre-trained image inpainting model. Additionally, we develop a novel inpainting mask detection technique to derive targeted inpainting masks from object masks, boosting the performance in handling unconstrained scenes. To validate the efficacy of our approach, we create a challenging and diverse benchmark that spans a wide range of scenes. Comprehensive experiments demonstrate that our proposed method substantially outperforms existing state-of-the-art approaches.

Paper Structure

This paper contains 22 sections, 2 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: We present a novel 3D inpainting method capable of removing target objects from the 3D scenes and seamlessly inpainting the occluded regions across various scenarios. The proposed method ensures the generation of plausible geometry and consistent textures. It is highly recommended to see video results in our https://xinxinzuo2353.github.io/imfine.
  • Figure 2: Overview of our proposed approach. (a) shows the proposed 3D inpainting pipeline. Specifically, a reference view is first inpainted, which will be used to reconstruct complete geometry, along with rendered multi-view depth maps. The inpainted reference image is then warped to other views, which are taken as input conditions for the multi-view refinement network to generate a set of view-consistent images. Afterward, we fine-tune the 3D scene to get the inpainted scene. (b) illustrates the detailed structure and training of the multi-view refinement network. Basically, we take the pre-trained image inpainting model, incorporate additional space-time attention layers, and exploit the test-time adaptation to fine-tune it into a multi-view refinement model. To generate a fine-tuning dataset, we synthesize warped images under the original captured scene, which has the ground-truth captured images. In detail, reference views are selected randomly, and we set random masks over the images, apply geometry jittering, and warp the reference images in other views.
  • Figure 3: Illustration of the importance of multi-view refinement. The multi-view refinement model is proposed to tackle the warping artifacts with the example shown in (b), generating high-fidelity multi-view consistent images with one sampled view shown in (c).
  • Figure 4: Inpainting mask detection. (a) shows the object mask (orange) along with the dilated mask (pink), which includes additional neighboring pixels; Each dilated mask consists of two parts: the actual inpainting mask and out-of-interest areas, marked as regions ① and ② in (b). These regions are mapped in 3D to the NBS and background areas, respectively, as shown in regions ① and ② in (c). During multi-view mapping optimization, the central NBS region is progressively refined, while background areas are suppressed by each other; (d) shows the rendered neighbor mask after GS scene optimization.
  • Figure 5: Preview of our collected dataset. The training images are displayed at the top, and the testing images are at the bottom. More visual data about the dataset can be found in the supplement.
  • ...and 10 more figures