MM3DGS SLAM: Multi-modal 3D Gaussian Splatting for SLAM Using Vision, Depth, and Inertial Measurements

Lisong C. Sun; Neel P. Bhatt; Jonathan C. Liu; Zhiwen Fan; Zhangyang Wang; Todd E. Humphreys; Ufuk Topcu

MM3DGS SLAM: Multi-modal 3D Gaussian Splatting for SLAM Using Vision, Depth, and Inertial Measurements

Lisong C. Sun, Neel P. Bhatt, Jonathan C. Liu, Zhiwen Fan, Zhangyang Wang, Todd E. Humphreys, Ufuk Topcu

TL;DR

MM3DGS SLAM introduces a multi-modal, real-time SLAM system that represents scenes with a dense set of 3D Gaussians and fuses RGB/RGB-D imagery with IMU data. The approach integrates differentiable 3D Gaussian splatting, depth supervision, and inertial fusion across a keyframe-based pipeline (tracking, keyframe selection, Gaussian initialization, mapping) to improve trajectory accuracy and photorealistic rendering. It demonstrates a $3\times$ improvement in tracking and a $5\%$ gain in PSNR over the state-of-the-art RGB-D Gaussian Splatting SLAM baseline on UT-MM, while delivering real-time rendering at 90 fps. A new UT-MM dataset is released to support multi-modal SLAM research, with suggestions for future work including tightly-coupled IMU fusion and loop-closure strategies.

Abstract

Simultaneous localization and mapping is essential for position tracking and scene understanding. 3D Gaussian-based map representations enable photorealistic reconstruction and real-time rendering of scenes using multiple posed cameras. We show for the first time that using 3D Gaussians for map representation with unposed camera images and inertial measurements can enable accurate SLAM. Our method, MM3DGS, addresses the limitations of prior neural radiance field-based representations by enabling faster rendering, scale awareness, and improved trajectory tracking. Our framework enables keyframe-based mapping and tracking utilizing loss functions that incorporate relative pose transformations from pre-integrated inertial measurements, depth estimates, and measures of photometric rendering quality. We also release a multi-modal dataset, UT-MM, collected from a mobile robot equipped with a camera and an inertial measurement unit. Experimental evaluation on several scenes from the dataset shows that MM3DGS achieves 3x improvement in tracking and 5% improvement in photometric rendering quality compared to the current 3DGS SLAM state-of-the-art, while allowing real-time rendering of a high-resolution dense 3D map. Project Webpage: https://vita-group.github.io/MM3DGS-SLAM

MM3DGS SLAM: Multi-modal 3D Gaussian Splatting for SLAM Using Vision, Depth, and Inertial Measurements

TL;DR

improvement in tracking and a

gain in PSNR over the state-of-the-art RGB-D Gaussian Splatting SLAM baseline on UT-MM, while delivering real-time rendering at 90 fps. A new UT-MM dataset is released to support multi-modal SLAM research, with suggestions for future work including tightly-coupled IMU fusion and loop-closure strategies.

Abstract

Paper Structure (22 sections, 13 equations, 8 figures, 3 tables)

This paper contains 22 sections, 13 equations, 8 figures, 3 tables.

INTRODUCTION
RELATED WORKS
SLAM Map Representations
Efficient 3D Representation
Multi-modal SLAM Frameworks
METHOD
3D Gaussian Splatting
Tracking
Depth Supervision
Inertial Fusion
Gaussian Initialization
Keyframe Selection
Mapping
EXPERIMENTAL SETUP
Datasets
...and 7 more sections

Figures (8)

Figure 1: Overview of the MM3DGS framework. We receive camera images and inertial measurements from a mobile robot. We utilize depth measurements and IMU pre-integration for pose optimization using a combined tracking loss. We apply a keyframe selection approach based on image covisibility and the NIQE metric across a sliding window and initialize new 3D Gaussians for keyframes with low opacity and high depth error mittal2013niqe. Finally, we optimize parameters of the 3D Gaussians according the mapping loss for the selected keyframes.
Figure 2: Our dataset provides RGB images (top left), depth images (top right), IMU measurements (bottom left), and LIDAR point clouds (bottom right). The above examples were taken from the Ego-drive scene.
Figure 3: A depiction of the mobile robot platform (left) equipped with a RGB-D camera, IMU, and a LiDAR and the test environment (right) featuring a 16 camera Vicon-based ground truth system.
Figure 4: Qualitative results on UT-MM dataset: RGB and depth renderings of UT-MM scenes. Note that the ground truth (GT) depths are captured with depth cameras, and thus are imperfect. Our method exhibits geometric details not present in the GT depth, as well as fewer RGB artifacts compared to SplaTAM.
Figure 5: Tracking results for the UT-MM Square-1 scene. The blue solid line denotes the tracked trajectory, while the red dotted line denotes the ground truth. Top: monocular RGB case exhibits substantial drift. Middle: RGB-D case fixes Z drift, but XY drift persists. Bottom: Adding IMU measurements to RGB-D fixes XY drift.
...and 3 more figures

MM3DGS SLAM: Multi-modal 3D Gaussian Splatting for SLAM Using Vision, Depth, and Inertial Measurements

TL;DR

Abstract

MM3DGS SLAM: Multi-modal 3D Gaussian Splatting for SLAM Using Vision, Depth, and Inertial Measurements

Authors

TL;DR

Abstract

Table of Contents

Figures (8)