Table of Contents
Fetching ...

IG-SLAM: Instant Gaussian SLAM

F. Aykut Sarikamis, A. Aydin Alatan

TL;DR

IG-SLAM tackles real-time dense RGB-only SLAM by uniting robust Dense-SLAM tracking with 3D Gaussian Splatting. It leverages depth uncertainty in map optimization and a decay-based training strategy to enable about 10 fps on a single process while maintaining high-quality reconstructions. A DROID-SLAM-based tracking backbone yields poses $ extbf{G}_t$ and dense depths $ extbf{d}_t$ with covariance $oldsymbol{ abla}$ that supervise a differentiable Gaussian Splatting mapper, employing a coarse-to-fine training regime. Evaluations on Replica, TUM-RGB-D, ScanNet, and EuRoC show competitive rendering and 3D reconstruction quality with notable speed gains, including best performance on EuRoC, demonstrating the practicality of depth-aware Gaussian SLAM for large-scale sequences.

Abstract

3D Gaussian Splatting has recently shown promising results as an alternative scene representation in SLAM systems to neural implicit representations. However, current methods either lack dense depth maps to supervise the mapping process or detailed training designs that consider the scale of the environment. To address these drawbacks, we present IG-SLAM, a dense RGB-only SLAM system that employs robust Dense-SLAM methods for tracking and combines them with Gaussian Splatting. A 3D map of the environment is constructed using accurate pose and dense depth provided by tracking. Additionally, we utilize depth uncertainty in map optimization to improve 3D reconstruction. Our decay strategy in map optimization enhances convergence and allows the system to run at 10 fps in a single process. We demonstrate competitive performance with state-of-the-art RGB-only SLAM systems while achieving faster operation speeds. We present our experiments on the Replica, TUM-RGBD, ScanNet, and EuRoC datasets. The system achieves photo-realistic 3D reconstruction in large-scale sequences, particularly in the EuRoC dataset.

IG-SLAM: Instant Gaussian SLAM

TL;DR

IG-SLAM tackles real-time dense RGB-only SLAM by uniting robust Dense-SLAM tracking with 3D Gaussian Splatting. It leverages depth uncertainty in map optimization and a decay-based training strategy to enable about 10 fps on a single process while maintaining high-quality reconstructions. A DROID-SLAM-based tracking backbone yields poses and dense depths with covariance that supervise a differentiable Gaussian Splatting mapper, employing a coarse-to-fine training regime. Evaluations on Replica, TUM-RGB-D, ScanNet, and EuRoC show competitive rendering and 3D reconstruction quality with notable speed gains, including best performance on EuRoC, demonstrating the practicality of depth-aware Gaussian SLAM for large-scale sequences.

Abstract

3D Gaussian Splatting has recently shown promising results as an alternative scene representation in SLAM systems to neural implicit representations. However, current methods either lack dense depth maps to supervise the mapping process or detailed training designs that consider the scale of the environment. To address these drawbacks, we present IG-SLAM, a dense RGB-only SLAM system that employs robust Dense-SLAM methods for tracking and combines them with Gaussian Splatting. A 3D map of the environment is constructed using accurate pose and dense depth provided by tracking. Additionally, we utilize depth uncertainty in map optimization to improve 3D reconstruction. Our decay strategy in map optimization enhances convergence and allows the system to run at 10 fps in a single process. We demonstrate competitive performance with state-of-the-art RGB-only SLAM systems while achieving faster operation speeds. We present our experiments on the Replica, TUM-RGBD, ScanNet, and EuRoC datasets. The system achieves photo-realistic 3D reconstruction in large-scale sequences, particularly in the EuRoC dataset.
Paper Structure (21 sections, 8 equations, 6 figures, 14 tables)

This paper contains 21 sections, 8 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Qualitative rendering results from Photo-SLAM photo-slam and IG-SLAM. We compare the visual quality of the methods on the large-scale EuRoC dataset euroc.
  • Figure 2: System Overview. Our system takes an RGB image stream as input and outputs the camera pose and scene representation in the form of a set of Gaussians. We decouple this objective into two parts: tracking and mapping. Tracking: Keyframes are created and added to the frame graph based on average optical flow. Pretrained GRU refines optical flow between keyframes. Dense bundle adjustment (DBA) is performed on the frame graph, minimizing reprojection error while optimizing the dense depth map and camera pose, and calculating depth map covariance simultaneously. After several iterations, depth maps and camera poses are expected to converge. Mapping: Keyframes' pose, depth, and covariance obtained from tracking are used for 3D reconstruction. We initialize Gaussians from low covariance regions utilizing the camera pose and depth map. 3D Gaussians are then projected onto the image plane and rendered utilizing a differentiable tile rasterizer. The loss function is a combination of depth and color loss. The depth loss is weighted by covariance. Finally, the loss is backpropagated to optimize Gaussians orientation, scaling, opacity, position, and color designated by orange arrows in the figure. Moreover, Gaussians are split, cloned, and pruned based on the local gradients.
  • Figure 3: An example of normalized covariance(left) and corresponding mask(right). The mask is created by thresholding normalized covariance with a maximum filter and smoothing with a majority filter. The white region on the mask is left out and not used during Gaussian initialization.
  • Figure 4: Three hypothetical cases to encounter in training. Dashed lines pass through ground truth Gaussian positions from the camera center. The faded Gaussians represent their previous positions. Red lines are the position update steps along the gradient direction. In A), a large position update causes the order of Gaussians to change, creating TV-static-like noise in training. In B), multiple iterations are needed to move Gaussians to the correct place because of small position updates. C) represents the ideal case where position update is exactly the position error.
  • Figure 5: Qualitative results for learning rate decay ablation study. The four cases studied in \ref{['tab:office0_decay']} are shown in the figure. The results are given as constant learning rates of $1.6e-4$ at top-left, 5.0e-5 at top-right, $1.6e-6$ at bottom-left and the decaying $1.6e-4$ learning rate at bottom-left as reference.
  • ...and 1 more figures