Table of Contents
Fetching ...

VIGS-SLAM: Visual Inertial Gaussian Splatting SLAM

Zihan Zhu, Wei Zhang, Norbert Haala, Marc Pollefeys, Daniel Barath

TL;DR

VIGS-SLAM targets robust dense SLAM by tightly integrating high-frequency IMU data with a learning-enhanced, Gaussian Splatting representation. The approach couples visual residuals with inertial terms, uses staged IMU initialization, and maintains loop closures via a Sim(3) pose graph, followed by efficient Gaussian map updates. Extensive experiments across indoor/outdoor, handheld/drone, and diverse datasets show state-of-the-art tracking accuracy and high-quality novel-view rendering, including resilience under motion blur, low texture, and frame-drop conditions. This work advances dense VI-SLAM by enabling real-time photorealistic mapping with robust loop closure and online parameter adaptation, making it particularly suitable for AR/VR and robotics applications.

Abstract

We present VIGS-SLAM, a visual-inertial 3D Gaussian Splatting SLAM system that achieves robust real-time tracking and high-fidelity reconstruction. Although recent 3DGS-based SLAM methods achieve dense and photorealistic mapping, their purely visual design degrades under motion blur, low texture, and exposure variations. Our method tightly couples visual and inertial cues within a unified optimization framework, jointly refining camera poses, depths, and IMU states. It features robust IMU initialization, time-varying bias modeling, and loop closure with consistent Gaussian updates. Experiments on four challenging datasets demonstrate our superiority over state-of-the-art methods. Project page: https://vigs-slam.github.io

VIGS-SLAM: Visual Inertial Gaussian Splatting SLAM

TL;DR

VIGS-SLAM targets robust dense SLAM by tightly integrating high-frequency IMU data with a learning-enhanced, Gaussian Splatting representation. The approach couples visual residuals with inertial terms, uses staged IMU initialization, and maintains loop closures via a Sim(3) pose graph, followed by efficient Gaussian map updates. Extensive experiments across indoor/outdoor, handheld/drone, and diverse datasets show state-of-the-art tracking accuracy and high-quality novel-view rendering, including resilience under motion blur, low texture, and frame-drop conditions. This work advances dense VI-SLAM by enabling real-time photorealistic mapping with robust loop closure and online parameter adaptation, making it particularly suitable for AR/VR and robotics applications.

Abstract

We present VIGS-SLAM, a visual-inertial 3D Gaussian Splatting SLAM system that achieves robust real-time tracking and high-fidelity reconstruction. Although recent 3DGS-based SLAM methods achieve dense and photorealistic mapping, their purely visual design degrades under motion blur, low texture, and exposure variations. Our method tightly couples visual and inertial cues within a unified optimization framework, jointly refining camera poses, depths, and IMU states. It features robust IMU initialization, time-varying bias modeling, and loop closure with consistent Gaussian updates. Experiments on four challenging datasets demonstrate our superiority over state-of-the-art methods. Project page: https://vigs-slam.github.io

Paper Structure

This paper contains 49 sections, 8 equations, 5 figures, 20 tables.

Figures (5)

  • Figure 1: VIGS-SLAM. Given a sequence of RGB frames and IMU readings, our method robustly tracks the camera trajectory while reconstructing a 3D Gaussian map. Above is the visualization of Retail sequence in FAST-LIVO2 zheng2024fast dataset.
  • Figure 2: System Overview. VIGS-SLAM takes as input a sequence of RGB frames and IMU readings, and simultaneously estimates camera poses while building a 3D Gaussian map $\mathcal{G}$. Keyframes are selected based on optical flow, and each new keyframe is initialized using the IMU pre-integration from the previous keyframe to the current one. This keyframe is then added to the local frame graph, where visual-inertial bundle adjustment jointly optimizes camera poses, depths, and IMU parameters. Visual correspondences are iteratively refined using a recurrent ConvGRU module. In parallel, a global pose graph is maintained using relative pose constraints from the frontend tracking. For Gaussian mapping, the depth of each new keyframe is unprojected into 3D using the estimated pose, converted into initial Gaussians, and fused into the global map. Both color and depth re-rendering losses are used to refine the Gaussians. Loop closure detection is performed based on optical flow differences between the new keyframe and all previous ones. When a loop is detected, pose graph bundle adjustment is performed, followed by an efficient Gaussian update to maintain global consistency.
  • Figure 3: Novel View Synthesis Results across Datasets. Sequences are sampled from RPNG Chen2023rpng (table_01, table_06), UTMM sun2024mm3dgs (EgoDrv, Sq-2), and FAST-LIVO2 zheng2024fast (CBD2, HKU) datasets.
  • Figure 4: Average Tracking Performance on Strided Datasets. We plot mean recall at 5 cm and 10 cm thresholds under different stride settings. All baseline results are obtained from the authors' official code, using dataset-specific configurations when available.
  • Figure 5: Tracking under Extreme Conditions. In a challenging sequence containing textureless regions, exposure variations, and motion blur, our VIGS-SLAM maintains stable tracking, whereas HI-SLAM2 drifts and ORB-SLAM3 succeeds only on a short segment (ATE RMSE $\downarrow$ [cm]: VIGS-SLAM 6.97, HI-SLAM2 98.55, ORB-SLAM3 25.59).