Table of Contents
Fetching ...

GSO-SLAM: Bidirectionally Coupled Gaussian Splatting and Direct Visual Odometry

Jiung Yeon, Seongbo Ha, Hyeonwoo Yu

TL;DR

GSO-SLAM tackles real-time monocular dense SLAM by bidirectionally coupling Visual Odometry and Gaussian Splatting through an EM framework, enabling joint refinement of camera poses, semi-dense depth, and the Gaussian scene without extra computational cost. A novel Gaussian Splat Initialization leverages VO outputs, image gradients, and multi-keyframe covariances to quickly initialize Gaussians close to their final configuration, accelerating convergence and improving fidelity. The method is evaluated across synthetic and real datasets, demonstrating superior geometric and photometric reconstruction quality and robust tracking, with real-time performance and favorable scalability. This work offers a practical pathway to high-fidelity dense mapping on monocular systems, reducing computational overhead while enhancing both tracking robustness and map quality in diverse environments.

Abstract

We propose GSO-SLAM, a real-time monocular dense SLAM system that leverages Gaussian scene representation. Unlike existing methods that couple tracking and mapping with a unified scene, incurring computational costs, or loosely integrate them with well-structured tracking frameworks, introducing redundancies, our method bidirectionally couples Visual Odometry (VO) and Gaussian Splatting (GS). Specifically, our approach formulates joint optimization within an Expectation-Maximization (EM) framework, enabling the simultaneous refinement of VO-derived semi-dense depth estimates and the GS representation without additional computational overhead. Moreover, we present Gaussian Splat Initialization, which utilizes image information, keyframe poses, and pixel associations from VO to produce close approximations to the final Gaussian scene, thereby eliminating the need for heuristic methods. Through extensive experiments, we validate the effectiveness of our method, showing that it not only operates in real time but also achieves state-of-the-art geometric/photometric fidelity of the reconstructed scene and tracking accuracy.

GSO-SLAM: Bidirectionally Coupled Gaussian Splatting and Direct Visual Odometry

TL;DR

GSO-SLAM tackles real-time monocular dense SLAM by bidirectionally coupling Visual Odometry and Gaussian Splatting through an EM framework, enabling joint refinement of camera poses, semi-dense depth, and the Gaussian scene without extra computational cost. A novel Gaussian Splat Initialization leverages VO outputs, image gradients, and multi-keyframe covariances to quickly initialize Gaussians close to their final configuration, accelerating convergence and improving fidelity. The method is evaluated across synthetic and real datasets, demonstrating superior geometric and photometric reconstruction quality and robust tracking, with real-time performance and favorable scalability. This work offers a practical pathway to high-fidelity dense mapping on monocular systems, reducing computational overhead while enhancing both tracking robustness and map quality in diverse environments.

Abstract

We propose GSO-SLAM, a real-time monocular dense SLAM system that leverages Gaussian scene representation. Unlike existing methods that couple tracking and mapping with a unified scene, incurring computational costs, or loosely integrate them with well-structured tracking frameworks, introducing redundancies, our method bidirectionally couples Visual Odometry (VO) and Gaussian Splatting (GS). Specifically, our approach formulates joint optimization within an Expectation-Maximization (EM) framework, enabling the simultaneous refinement of VO-derived semi-dense depth estimates and the GS representation without additional computational overhead. Moreover, we present Gaussian Splat Initialization, which utilizes image information, keyframe poses, and pixel associations from VO to produce close approximations to the final Gaussian scene, thereby eliminating the need for heuristic methods. Through extensive experiments, we validate the effectiveness of our method, showing that it not only operates in real time but also achieves state-of-the-art geometric/photometric fidelity of the reconstructed scene and tracking accuracy.
Paper Structure (17 sections, 12 equations, 6 figures, 8 tables)

This paper contains 17 sections, 12 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Comparison of Reconstructed Scene and Rendering Results from Different SLAM Systems. The top two images show the reconstructed 3D dense scene, highlighting the rapid reconstruction of our method. The bottom images present depth and RGB renderings, demonstrating our system's geometric accuracy and photometric fidelity.
  • Figure 2: SLAM System Overview. Our system reconstructs a 3D scene from monocular video. After tracking and keyframe selection, new Gaussian splats $\mathcal{G}_{new}$ are initialized, followed by EM-based joint optimization of both camera poses and the dense scene.
  • Figure 3: Gaussian Splat Initialization. Our initialization comprises 3 steps: (1) estimating 2D covariances from keyframe image intensities and gradients, (2) combining them to compute the 3D covariance, and (3) applying eigen-decomposition to extract the rotation and scaling parameters for the initial Gaussian.
  • Figure 4: Comparison of Rendering Results on Replica Dataset. Our method exhibits superior geometric accuracy and high photometric fidelity compared to other methods.
  • Figure 5: Comparison of Rendering Results on Self-captured Real-world Dataset collected by a Quadrupedal Robot. RGB images rendered from the reconstructed scenes of RGB-based approaches. Our method exhibits better visual photometric fidelity and sharper details than other methods.
  • ...and 1 more figures