Table of Contents
Fetching ...

MVS-GS: High-Quality 3D Gaussian Splatting Mapping via Online Multi-View Stereo

Byeonggwon Lee, Junkyu Park, Khang Truong Giang, Sungho Jo, Soohwan Song

TL;DR

This paper tackles online, high-fidelity 3D model generation for neural rendering from RGB streams, addressing depth ambiguities that degrade renderings. It introduces a two-branch pipeline: a frontend that performs camera tracking and online MVS depth estimation with MVSFormer in a local time window, and a backend that densifies and optimizes 3D Gaussian splats (GES) in parallel, using filtered depths from sequential views to initialize Gaussian points. The approach includes a depth refinement step via V-Fuse and a PSNR-based detection of unexplored regions to efficiently densify the scene with adaptive density control through differentiable rendering. Experiments on indoor (Replica, TUM-RGBD) and outdoor ( Tanks and Temples, aerial) datasets show the method outperforms state-of-the-art dense SLAM methods, offering robust outdoor performance and detailed reconstructions.

Abstract

This study addresses the challenge of online 3D model generation for neural rendering using an RGB image stream. Previous research has tackled this issue by incorporating Neural Radiance Fields (NeRF) or 3D Gaussian Splatting (3DGS) as scene representations within dense SLAM methods. However, most studies focus primarily on estimating coarse 3D scenes rather than achieving detailed reconstructions. Moreover, depth estimation based solely on images is often ambiguous, resulting in low-quality 3D models that lead to inaccurate renderings. To overcome these limitations, we propose a novel framework for high-quality 3DGS modeling that leverages an online multi-view stereo (MVS) approach. Our method estimates MVS depth using sequential frames from a local time window and applies comprehensive depth refinement techniques to filter out outliers, enabling accurate initialization of Gaussians in 3DGS. Furthermore, we introduce a parallelized backend module that optimizes the 3DGS model efficiently, ensuring timely updates with each new keyframe. Experimental results demonstrate that our method outperforms state-of-the-art dense SLAM methods, particularly excelling in challenging outdoor environments.

MVS-GS: High-Quality 3D Gaussian Splatting Mapping via Online Multi-View Stereo

TL;DR

This paper tackles online, high-fidelity 3D model generation for neural rendering from RGB streams, addressing depth ambiguities that degrade renderings. It introduces a two-branch pipeline: a frontend that performs camera tracking and online MVS depth estimation with MVSFormer in a local time window, and a backend that densifies and optimizes 3D Gaussian splats (GES) in parallel, using filtered depths from sequential views to initialize Gaussian points. The approach includes a depth refinement step via V-Fuse and a PSNR-based detection of unexplored regions to efficiently densify the scene with adaptive density control through differentiable rendering. Experiments on indoor (Replica, TUM-RGBD) and outdoor ( Tanks and Temples, aerial) datasets show the method outperforms state-of-the-art dense SLAM methods, offering robust outdoor performance and detailed reconstructions.

Abstract

This study addresses the challenge of online 3D model generation for neural rendering using an RGB image stream. Previous research has tackled this issue by incorporating Neural Radiance Fields (NeRF) or 3D Gaussian Splatting (3DGS) as scene representations within dense SLAM methods. However, most studies focus primarily on estimating coarse 3D scenes rather than achieving detailed reconstructions. Moreover, depth estimation based solely on images is often ambiguous, resulting in low-quality 3D models that lead to inaccurate renderings. To overcome these limitations, we propose a novel framework for high-quality 3DGS modeling that leverages an online multi-view stereo (MVS) approach. Our method estimates MVS depth using sequential frames from a local time window and applies comprehensive depth refinement techniques to filter out outliers, enabling accurate initialization of Gaussians in 3DGS. Furthermore, we introduce a parallelized backend module that optimizes the 3DGS model efficiently, ensuring timely updates with each new keyframe. Experimental results demonstrate that our method outperforms state-of-the-art dense SLAM methods, particularly excelling in challenging outdoor environments.
Paper Structure (11 sections, 3 equations, 6 figures, 3 tables)

This paper contains 11 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Rendering results of (a) Photo-SLAM huang2024photo and (b) our method. The optimized Gaussian points and the estimated point cloud for each method are shown in the upper right and lower right, respectively. Our method utilizes an online MVS approach to generate a dense point cloud, which is then used to initialize Gaussian points. As a result, our method produces a 3DGS model with much higher rendering quality compared to Photo-SLAM, which relies on a sparse point cloud.
  • Figure 2: System overview: Our system consists of a frontend and a backend, both running in parallel. The frontend initially estimates the camera pose of keyframes using SLAM. It then estimates the depth map and confidence map of each keyframe based on MVS, refining the depth map by incorporating depth information from sequential keyframes. The backend generates new Gaussian points from the refined depth map and integrates them into the 3DGS model. The backend then continuously optimizes the 3DGS model.
  • Figure 3: An illustration of the depth map refinement and filtering process: (b) the depth map and (c) the confidence map estimated by MVS are (d) refined using V-Fuse burgdorfer2023v. Our method then (e) filters outliers by checking the geometric consistency of the depths.
  • Figure 4: Qualitative evaluation on the Replica RGB dataset (first and second rows) and the TUM-RGBD dataset (third and fourth rows).
  • Figure 5: Rendering results of (a) Photo-SLAM huang2024photo and (b) our method on two aerial scenes. Each PSNR and SSIM represents the average result for the entire scene.
  • ...and 1 more figures