MVInpainter: Learning Multi-View Consistent Inpainting to Bridge 2D and 3D Editing

Chenjie Cao; Chaohui Yu; Fan Wang; Xiangyang Xue; Yanwei Fu

MVInpainter: Learning Multi-View Consistent Inpainting to Bridge 2D and 3D Editing

Chenjie Cao, Chaohui Yu, Fan Wang, Xiangyang Xue, Yanwei Fu

TL;DR

MVInpainter partially inpaints multi-view images with the reference guidance rather than intractably generating an entirely novel view from scratch, which largely simplifies the difficulty of in-the-wild NVS and leverages unmasked clues instead of explicit pose conditions.

Abstract

Novel View Synthesis (NVS) and 3D generation have recently achieved prominent improvements. However, these works mainly focus on confined categories or synthetic 3D assets, which are discouraged from generalizing to challenging in-the-wild scenes and fail to be employed with 2D synthesis directly. Moreover, these methods heavily depended on camera poses, limiting their real-world applications. To overcome these issues, we propose MVInpainter, re-formulating the 3D editing as a multi-view 2D inpainting task. Specifically, MVInpainter partially inpaints multi-view images with the reference guidance rather than intractably generating an entirely novel view from scratch, which largely simplifies the difficulty of in-the-wild NVS and leverages unmasked clues instead of explicit pose conditions. To ensure cross-view consistency, MVInpainter is enhanced by video priors from motion components and appearance guidance from concatenated reference key&value attention. Furthermore, MVInpainter incorporates slot attention to aggregate high-level optical flow features from unmasked regions to control the camera movement with pose-free training and inference. Sufficient scene-level experiments on both object-centric and forward-facing datasets verify the effectiveness of MVInpainter, including diverse tasks, such as multi-view object removal, synthesis, insertion, and replacement. The project page is https://ewrfcas.github.io/MVInpainter/.

MVInpainter: Learning Multi-View Consistent Inpainting to Bridge 2D and 3D Editing

TL;DR

Abstract

Paper Structure (27 sections, 4 equations, 22 figures, 8 tables)

This paper contains 27 sections, 4 equations, 22 figures, 8 tables.

Introduction
Related Work
Approach
MVInpainter Tasks
Multi-View Consistent Inpainting Model
Pose-Free Flow Grouping
Inference
Experiments
Object-Centric Results
Forward-Facing Results
Real-World 3D Scene Editing
Ablation Study
Conclusion
Supplementary Results
More Details and Comparison
...and 12 more sections

Figures (22)

Figure 1: MVInpainter addresses 2D/3D editing tasks: (a) novel view synthesis, (b) multi-view object removal, and (c) object insertion and replacement through multi-view consistent inpainting ability. Given one inpainted or edited reference image, MVInpainter spreads it to other masked views without pose conditions. (d) MVInpainter could be applied to real-world 3D scene editing for dense point clouds by Dust3R dust3r_cvpr24 or Multi-View Stereo (MVS) cao2024mvsformer++ and 3DGS kerbl3Dgaussians with consistent generation.
Figure 2: The overall pipeline and main contributions of MVInpainter. We primarily focus on multi-view inpainting, while the 3D reconstruction is detailed in Appendix Sec. \ref{['sec:3d_editing']}.
Figure 3: (a) The overview of the proposed MVInpainter. MVInpainter-O is trained on object-centric data, while MVInpainter-F is trained on forward-facing data with a shared SD-inpainting backbone of different LoRA/motion weights and masking strategies. The object-centric MVInpainter focuses on the object-level NVS, while the forward-facing one is devoted to object removal and scene-level inpainting. (b) The Ref-KV is used in spatial self-attention blocks of denoising U-Net. (c) The slot-attention based flow grouping module is used to learn implicit pose features. Dashed boxes in (b) and (c) mean feature concatenation.
Figure 4: (a) The inference pipeline includes object removal, mask adaption, and object insertion. (b) The illustration of heuristic masking adaption, which is built from yellow points of the closed convex hull. (c) The perspective warping based on the basic plane and the bottom face. All matches are on the basic plane filtered by Grounded-SAM ren2024grounded with captions "table" and "tablecloth".
Figure 5: Object-centric results on CO3D, MVImgNet, and Omni3D. The first row denotes the reference (first column) and other masked inputs, while other results are sampled from LeftRefill cao2024leftrefill, Nerfiller weber2024nerfiller, ZeroNVS sargent2023zeronvs, and our MVInpainter. Please zoom-in for details.
...and 17 more figures

MVInpainter: Learning Multi-View Consistent Inpainting to Bridge 2D and 3D Editing

TL;DR

Abstract

MVInpainter: Learning Multi-View Consistent Inpainting to Bridge 2D and 3D Editing

Authors

TL;DR

Abstract

Table of Contents

Figures (22)