Table of Contents
Fetching ...

VINGS-Mono: Visual-Inertial Gaussian Splatting Monocular SLAM in Large Scenes

Ke Wu, Zicheng Zhang, Muer Tie, Ziqing Ai, Zhongxue Gan, Wenchao Ding

TL;DR

VINGS-Mono introduces a monocular (inertial) Gaussian Splatting SLAM framework tailored for large-scale outdoor environments. It combines a Visual-Inertial Front End with a 2D Gaussian Map, a Novel View Synthesis–driven loop closure, and a Dynamic Object Eraser, supported by a Score Manager, Sample Rasterizer, and Single-to-Multi Pose Refinement to maintain global consistency while processing tens of millions of Gaussians in real time. The method achieves localization on par with Visual-Inertial Odometry and surpasses prior Gaussian/NeRF SLAM approaches in mapping and rendering quality, extending applicability to kilometer-scale urban scenes and mobile devices. Comprehensive indoor/outdoor experiments, ablations, and a real-world mobile app validate the approach and demonstrate robust performance under dynamic, large-scale conditions, highlighting practical potential for VR/AR and digital twins.

Abstract

VINGS-Mono is a monocular (inertial) Gaussian Splatting (GS) SLAM framework designed for large scenes. The framework comprises four main components: VIO Front End, 2D Gaussian Map, NVS Loop Closure, and Dynamic Eraser. In the VIO Front End, RGB frames are processed through dense bundle adjustment and uncertainty estimation to extract scene geometry and poses. Based on this output, the mapping module incrementally constructs and maintains a 2D Gaussian map. Key components of the 2D Gaussian Map include a Sample-based Rasterizer, Score Manager, and Pose Refinement, which collectively improve mapping speed and localization accuracy. This enables the SLAM system to handle large-scale urban environments with up to 50 million Gaussian ellipsoids. To ensure global consistency in large-scale scenes, we design a Loop Closure module, which innovatively leverages the Novel View Synthesis (NVS) capabilities of Gaussian Splatting for loop closure detection and correction of the Gaussian map. Additionally, we propose a Dynamic Eraser to address the inevitable presence of dynamic objects in real-world outdoor scenes. Extensive evaluations in indoor and outdoor environments demonstrate that our approach achieves localization performance on par with Visual-Inertial Odometry while surpassing recent GS/NeRF SLAM methods. It also significantly outperforms all existing methods in terms of mapping and rendering quality. Furthermore, we developed a mobile app and verified that our framework can generate high-quality Gaussian maps in real time using only a smartphone camera and a low-frequency IMU sensor. To the best of our knowledge, VINGS-Mono is the first monocular Gaussian SLAM method capable of operating in outdoor environments and supporting kilometer-scale large scenes.

VINGS-Mono: Visual-Inertial Gaussian Splatting Monocular SLAM in Large Scenes

TL;DR

VINGS-Mono introduces a monocular (inertial) Gaussian Splatting SLAM framework tailored for large-scale outdoor environments. It combines a Visual-Inertial Front End with a 2D Gaussian Map, a Novel View Synthesis–driven loop closure, and a Dynamic Object Eraser, supported by a Score Manager, Sample Rasterizer, and Single-to-Multi Pose Refinement to maintain global consistency while processing tens of millions of Gaussians in real time. The method achieves localization on par with Visual-Inertial Odometry and surpasses prior Gaussian/NeRF SLAM approaches in mapping and rendering quality, extending applicability to kilometer-scale urban scenes and mobile devices. Comprehensive indoor/outdoor experiments, ablations, and a real-world mobile app validate the approach and demonstrate robust performance under dynamic, large-scale conditions, highlighting practical potential for VR/AR and digital twins.

Abstract

VINGS-Mono is a monocular (inertial) Gaussian Splatting (GS) SLAM framework designed for large scenes. The framework comprises four main components: VIO Front End, 2D Gaussian Map, NVS Loop Closure, and Dynamic Eraser. In the VIO Front End, RGB frames are processed through dense bundle adjustment and uncertainty estimation to extract scene geometry and poses. Based on this output, the mapping module incrementally constructs and maintains a 2D Gaussian map. Key components of the 2D Gaussian Map include a Sample-based Rasterizer, Score Manager, and Pose Refinement, which collectively improve mapping speed and localization accuracy. This enables the SLAM system to handle large-scale urban environments with up to 50 million Gaussian ellipsoids. To ensure global consistency in large-scale scenes, we design a Loop Closure module, which innovatively leverages the Novel View Synthesis (NVS) capabilities of Gaussian Splatting for loop closure detection and correction of the Gaussian map. Additionally, we propose a Dynamic Eraser to address the inevitable presence of dynamic objects in real-world outdoor scenes. Extensive evaluations in indoor and outdoor environments demonstrate that our approach achieves localization performance on par with Visual-Inertial Odometry while surpassing recent GS/NeRF SLAM methods. It also significantly outperforms all existing methods in terms of mapping and rendering quality. Furthermore, we developed a mobile app and verified that our framework can generate high-quality Gaussian maps in real time using only a smartphone camera and a low-frequency IMU sensor. To the best of our knowledge, VINGS-Mono is the first monocular Gaussian SLAM method capable of operating in outdoor environments and supporting kilometer-scale large scenes.
Paper Structure (46 sections, 16 equations, 13 figures, 9 tables, 1 algorithm)

This paper contains 46 sections, 16 equations, 13 figures, 9 tables, 1 algorithm.

Figures (13)

  • Figure 1: VINGS-Mono's estimated trajectory and reconstructed gaussian map of three different scenes. Our method effectively estimates poses and reconstructs high-quality Gaussian maps across large-scale driving scenarios, aerial drone views, and indoor environments. Particularly for the driving scene on the left, the trajectory spans 3.7 kilometers and includes a Gaussian map containing 32.5 million Gaussian ellipsoids. During training, we track the number of Gaussians and zoom in on specific areas to improve visualization clarity. (Project page: https://vings-mono.github.io)
  • Figure 2: Pipeline of VINGS-Mono. RGB and IMU readings are processed by the Visual Inertial Frontend to calculate pose and inverse depth. Based on this, the 2D GS Map is incrementally updated, comprising a score manager, sample rasterization, and pose refinement. The NVS Loop Closure employs novel view synthesis for efficient loop detection and correction seamlessly. Furthermore, the Dynamic Object Eraser helps minimize the impact of moving objects on the framework.
  • Figure 3: Sample Rasterizer. In our backpropagation process, each thread is responsible for one Gaussian, and the number of iterations depends on the number of sampled pixels.
  • Figure 4: Pipeline of NVS Loop Closure. We perform feature matching, filtering, and novel view synthesis on keyframes that meet the distance threshold requirements to achieve loop detection. Once a loop is detected, we implement loop correction of the pose and Gaussian map through pairwise Gaussian with pose alignment and graph optimization.
  • Figure 5: Effect of Dynamic Object Eraser. Our dynamic eraser can filter out moving people indoors and fast-moving vehicles outdoors, preventing the Gaussian map from being affected by dynamic floaters.
  • ...and 8 more figures