Table of Contents
Fetching ...

MGS-SLAM: Monocular Sparse Tracking and Gaussian Mapping with Depth Smooth Regularization

Pengcheng Zhu, Yaoming Zhuang, Baoquan Chen, Li Li, Chengdong Wu, Zhanlin Liu

TL;DR

A novel framework for dense Visual Simultaneous Localization and Mapping (VSLAM) based on Gaussian Splatting that outperforms previous monocular methods in terms of novel view synthesis and geometric reconstruction fidelities and achieves state-of-the-art results.

Abstract

This letter introduces a novel framework for dense Visual Simultaneous Localization and Mapping (VSLAM) based on Gaussian Splatting. Recently, SLAM based on Gaussian Splatting has shown promising results. However, in monocular scenarios, the Gaussian maps reconstructed lack geometric accuracy and exhibit weaker tracking capability. To address these limitations, we jointly optimize sparse visual odometry tracking and 3D Gaussian Splatting scene representation for the first time. We obtain depth maps on visual odometry keyframe windows using a fast Multi-View Stereo (MVS) network for the geometric supervision of Gaussian maps. Furthermore, we propose a depth smooth loss and Sparse-Dense Adjustment Ring (SDAR) to reduce the negative effect of estimated depth maps and preserve the consistency in scale between the visual odometry and Gaussian maps. We have evaluated our system across various synthetic and real-world datasets. The accuracy of our pose estimation surpasses existing methods and achieves state-of-the-art. Additionally, it outperforms previous monocular methods in terms of novel view synthesis and geometric reconstruction fidelities.

MGS-SLAM: Monocular Sparse Tracking and Gaussian Mapping with Depth Smooth Regularization

TL;DR

A novel framework for dense Visual Simultaneous Localization and Mapping (VSLAM) based on Gaussian Splatting that outperforms previous monocular methods in terms of novel view synthesis and geometric reconstruction fidelities and achieves state-of-the-art results.

Abstract

This letter introduces a novel framework for dense Visual Simultaneous Localization and Mapping (VSLAM) based on Gaussian Splatting. Recently, SLAM based on Gaussian Splatting has shown promising results. However, in monocular scenarios, the Gaussian maps reconstructed lack geometric accuracy and exhibit weaker tracking capability. To address these limitations, we jointly optimize sparse visual odometry tracking and 3D Gaussian Splatting scene representation for the first time. We obtain depth maps on visual odometry keyframe windows using a fast Multi-View Stereo (MVS) network for the geometric supervision of Gaussian maps. Furthermore, we propose a depth smooth loss and Sparse-Dense Adjustment Ring (SDAR) to reduce the negative effect of estimated depth maps and preserve the consistency in scale between the visual odometry and Gaussian maps. We have evaluated our system across various synthetic and real-world datasets. The accuracy of our pose estimation surpasses existing methods and achieves state-of-the-art. Additionally, it outperforms previous monocular methods in terms of novel view synthesis and geometric reconstruction fidelities.
Paper Structure (14 sections, 19 equations, 7 figures, 8 tables)

This paper contains 14 sections, 19 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Map reconstruction process by the proposed system. The prior depth map is estimated from the keyframes of sparse visual odometry and optimized by a sparse point cloud map, and the optimized depth map is used to construct a dense Gaussian map.
  • Figure 2: System pipeline. The system inputs an RGB stream and operates frontend and backend processes in parallel. In the frontend, sparse visual odometry extracts patch features from images to estimate poses. These estimated poses and images are inputs to a pre-trained Multi-View Stereo (MVS) network, which estimates priori depth maps. In the backend, the estimated priori depth maps and images, coupled with poses from the frontend, are utilized as supervisory information to construct a Gaussian map. The frontend and backend maintain scale consistency through the SDAR strategy.
  • Figure 3: The fast Multi-View Stereo network. The inputs of the network are images with poses from sparse visual odometry, image features are extracted by Feature Pyramid Network (FPN) and warped to the 2D cost volume. Finally, encoded and decoded to depth maps using coarse-to-fine strategy.
  • Figure 4: Depth smooth regularization loss. Comparing the effect of having no depth smooth loss, there is better photometry and geometry with depth smooth loss, and bad photometry and geometry without depth smooth loss.
  • Figure 5: Priori depth optimization. this optimization strategy in the SDAR is to correct the geometry of the priori depth map from the MVS network and align the scale with the sparse point cloud map.
  • ...and 2 more figures