Visual SLAM with 3D Gaussian Primitives and Depth Priors Enabling Novel View Synthesis

Zhongche Qu; Zhi Zhang; Cong Liu; Jianhua Yin

Visual SLAM with 3D Gaussian Primitives and Depth Priors Enabling Novel View Synthesis

Zhongche Qu, Zhi Zhang, Cong Liu, Jianhua Yin

TL;DR

This work addresses the challenge of achieving dense, real-time SLAM by integrating 3D Gaussian Splatting (3DGS) with depth priors and differentiable rendering. A rotation-translation decoupled inverse-optimization pipeline estimates camera pose while updating a dense 3D Gaussian map, enabling novel view synthesis through fast CUDA rasterization. Depth priors regularize the 3DGS representation to mitigate multi-view inconsistencies, yielding improved pose accuracy and depth reconstruction. Evaluations on the TUM-RGBD dataset demonstrate centimeter-level localization, competitive view synthesis (PSNR $\approx$ 19–25 dB), and low depth RMSE ($\approx$ 2.8–6 cm), validating real-time performance and dense reconstruction capabilities. The approach advances dense SLAM by combining explicit 3D primitives, differentiable rendering, and regularization via depth information, with implications for AR/VR and robotics.

Abstract

Conventional geometry-based SLAM systems lack dense 3D reconstruction capabilities since their data association usually relies on feature correspondences. Additionally, learning-based SLAM systems often fall short in terms of real-time performance and accuracy. Balancing real-time performance with dense 3D reconstruction capabilities is a challenging problem. In this paper, we propose a real-time RGB-D SLAM system that incorporates a novel view synthesis technique, 3D Gaussian Splatting, for 3D scene representation and pose estimation. This technique leverages the real-time rendering performance of 3D Gaussian Splatting with rasterization and allows for differentiable optimization in real time through CUDA implementation. We also enable mesh reconstruction from 3D Gaussians for explicit dense 3D reconstruction. To estimate accurate camera poses, we utilize a rotation-translation decoupled strategy with inverse optimization. This involves iteratively updating both in several iterations through gradient-based optimization. This process includes differentiably rendering RGB, depth, and silhouette maps and updating the camera parameters to minimize a combined loss of photometric loss, depth geometry loss, and visibility loss, given the existing 3D Gaussian map. However, 3D Gaussian Splatting (3DGS) struggles to accurately represent surfaces due to the multi-view inconsistency of 3D Gaussians, which can lead to reduced accuracy in both camera pose estimation and scene reconstruction. To address this, we utilize depth priors as additional regularization to enforce geometric constraints, thereby improving the accuracy of both pose estimation and 3D reconstruction. We also provide extensive experimental results on public benchmark datasets to demonstrate the effectiveness of our proposed methods in terms of pose accuracy, geometric accuracy, and rendering performance.

Visual SLAM with 3D Gaussian Primitives and Depth Priors Enabling Novel View Synthesis

TL;DR

19–25 dB), and low depth RMSE (

2.8–6 cm), validating real-time performance and dense reconstruction capabilities. The approach advances dense SLAM by combining explicit 3D primitives, differentiable rendering, and regularization via depth information, with implications for AR/VR and robotics.

Abstract

Paper Structure (13 sections, 6 equations, 7 figures, 2 tables)

This paper contains 13 sections, 6 equations, 7 figures, 2 tables.

INTRODUCTION
PRELIMINARIES
Visual Geometry Residual
Neural Radiance Fields
3D Gaussians Splatting
Proposed Approach
Initialization
Pose Estimation
Scene Optimization
EVALUATION
Experiment Setup
Experiment Results
CONCLUSIONS

Figures (7)

Figure 1: Our final reconstruction results on the Freiburg3 long office sequence.
Figure 2: Overview of NeRF-based SLAM pipeline. Figure taken from ming2024benchmarking.
Figure 3: Pipeline of NeRF. Figure taken from nerf-c.
Figure 4: Qualitative rendering results on freigburg2. The top row, from left to right, shows the input RGB, input depth map, and rasterized silhouette. The bottom row, from left to right, displays the rendered RGB, rendered depth map, and visualization of the L1 loss on the depth map.
Figure 5: Qualitative rendering results on freiburg3 office. The top row, from left to right, shows the input RGB, input depth map, and rasterized silhouette. The bottom row, from left to right, displays the rendered RGB, rendered depth map, and visualization of the L1 loss on the depth map.
...and 2 more figures

Visual SLAM with 3D Gaussian Primitives and Depth Priors Enabling Novel View Synthesis

TL;DR

Abstract

Visual SLAM with 3D Gaussian Primitives and Depth Priors Enabling Novel View Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (7)