Table of Contents
Fetching ...

HOMER: Homography-Based Efficient Multi-view 3D Object Removal

Jingcheng Ni, Weiguang Zhao, Daniel Wang, Ziyao Zeng, Chenyu You, Alex Wong, Kaizhu Huang

TL;DR

HOMER tackles the challenge of efficient, consistent multi-view 3D object removal by removing the dependency on camera poses and extra training. It introduces a region-based interaction workflow and a light-weight Homography-based Mask Matching (HoMM) module to propagate removal masks across views via homographies, complemented by selective inpainting on key views. The method is compatible with diverse radiance-field representations (e.g., NeRF, 3D Gaussian Splatting) and includes a new multi-object open-scene dataset, demonstrating state-of-the-art performance while reducing runtime to about one-fifth of leading baselines. By decoupling mask generation from 3D model training and leveraging geometric propagation, HOMER achieves robust 3D removals suitable for real-world applications in AR/robotics. The results indicate practical gains in both efficiency and generalizability across different scene configurations and radiance-field backbones.

Abstract

3D object removal is an important sub-task in 3D scene editing, with broad applications in scene understanding, augmented reality, and robotics. However, existing methods struggle to achieve a desirable balance among consistency, usability, and computational efficiency in multi-view settings. These limitations are primarily due to unintuitive user interaction in the source view, inefficient multi-view object mask generation, computationally expensive inpainting procedures, and a lack of applicability across different radiance field representations. To address these challenges, we propose a novel pipeline that improves the quality and efficiency of multi-view object mask generation and inpainting. Our method introduces an intuitive region-based interaction mechanism in the source view and eliminates the need for camera poses or extra model training. Our lightweight HoMM module is employed to achieve high-quality multi-view mask propagation with enhanced efficiency. In the inpainting stage, we further reduce computational costs by performing inpainting only on selected key views and propagating the results to other views via homography-based mapping. Our pipeline is compatible with a variety of radiance field frameworks, including NeRF and 3D Gaussian Splatting, demonstrating improved generalizability and practicality in real-world scenarios. Additionally, we present a new 3D multi-object removal dataset with greater object diversity and viewpoint variation than existing datasets. Experiments on public benchmarks and our proposed dataset show that our method achieves state-of-the-art performance while reducing runtime to one-fifth of that required by leading baselines.

HOMER: Homography-Based Efficient Multi-view 3D Object Removal

TL;DR

HOMER tackles the challenge of efficient, consistent multi-view 3D object removal by removing the dependency on camera poses and extra training. It introduces a region-based interaction workflow and a light-weight Homography-based Mask Matching (HoMM) module to propagate removal masks across views via homographies, complemented by selective inpainting on key views. The method is compatible with diverse radiance-field representations (e.g., NeRF, 3D Gaussian Splatting) and includes a new multi-object open-scene dataset, demonstrating state-of-the-art performance while reducing runtime to about one-fifth of leading baselines. By decoupling mask generation from 3D model training and leveraging geometric propagation, HOMER achieves robust 3D removals suitable for real-world applications in AR/robotics. The results indicate practical gains in both efficiency and generalizability across different scene configurations and radiance-field backbones.

Abstract

3D object removal is an important sub-task in 3D scene editing, with broad applications in scene understanding, augmented reality, and robotics. However, existing methods struggle to achieve a desirable balance among consistency, usability, and computational efficiency in multi-view settings. These limitations are primarily due to unintuitive user interaction in the source view, inefficient multi-view object mask generation, computationally expensive inpainting procedures, and a lack of applicability across different radiance field representations. To address these challenges, we propose a novel pipeline that improves the quality and efficiency of multi-view object mask generation and inpainting. Our method introduces an intuitive region-based interaction mechanism in the source view and eliminates the need for camera poses or extra model training. Our lightweight HoMM module is employed to achieve high-quality multi-view mask propagation with enhanced efficiency. In the inpainting stage, we further reduce computational costs by performing inpainting only on selected key views and propagating the results to other views via homography-based mapping. Our pipeline is compatible with a variety of radiance field frameworks, including NeRF and 3D Gaussian Splatting, demonstrating improved generalizability and practicality in real-world scenarios. Additionally, we present a new 3D multi-object removal dataset with greater object diversity and viewpoint variation than existing datasets. Experiments on public benchmarks and our proposed dataset show that our method achieves state-of-the-art performance while reducing runtime to one-fifth of that required by leading baselines.

Paper Structure

This paper contains 14 sections, 11 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison of our method with baseline SPIn-NeRF mirzaei2023spin for multi-view object masks generation (top) and inpainting & reconstruction (bottom). The turtle icons indicate computationally intensive steps in the SPIn-NeRF pipeline, while rabbit icons highlight the efficiency improvements in our approach.
  • Figure 2: Network Architecture. Our proposed HOMER consists of three main stages: (a) Interaction Processing, (b) Multi-View Object Removal, and (c) 3D Reconstruction. (a) In the interaction stage, the user selects a source view and specifies the regions to be removed and preserved. Binary masks are generated and sequentially inpainted by LaMa to produce the updated source image. (b) In the multi-view removal stage, we compute homography matrices and warp the initial masks to other views to generate coarse masks. An adaptive anchor circle adjustment module is then used to generate final object masks for each view. Refined masks and inpainted results are propagated from the key views to other views with homography-based mapping. (c) The inpainted multi-view images and corresponding camera poses are fed into a radiance field model to reconstruct the 3D scene with objects removed.
  • Figure 3: NeRF renderings for our dataset.$w$ stands for $with$. We show the rendering of the first test pose for 4 sampled scenes in our dataset. We compare our method to the NeRFiller results using DINO masks, as well as using our generated masks on the second row, and our method on the third row.
  • Figure 4: Interaction Comparison
  • Figure 5: NeRF renderings for SPIn-NeRF dataset.$w$ stands for $with$ and GT denotes ground-truth. We show the rendering of the first test pose for 4 sampled scenes in the SPIn-NeRF dataset. We compare our method to the NeRFiller results using ground truth masks, as well as using our generated masks on the third row.