A real-time, robust and versatile visual-SLAM framework based on deep learning networks

Zhang Xiao; Shuaixin Li

A real-time, robust and versatile visual-SLAM framework based on deep learning networks

Zhang Xiao, Shuaixin Li

TL;DR

The paper addresses robust real-time visual-SLAM in challenging environments and presents Rover-SLAM, a versatile, hybrid vSLAM framework that integrates learning-based feature extraction (SuperPoint) and learning-based matching (LightGlue) across tracking, mapping, and loop closure. It supports monocular, monocular-inertial, stereo, and stereo-inertial configurations and uses adaptive feature filtering, a learning-based local mapping, and a deep BoW loop-closure descriptor, all deployed via ONNX Runtime. Extensive experiments on EuRoC, TUM-VI, and self-collected data demonstrate that Rover-SLAM achieves state-of-the-art localization accuracy and tracking robustness across configurations, while maintaining real-time performance. The work provides a practical, scalable platform with public code release that can benefit robotics, autonomous driving, and 3D reconstruction applications by improving SLAM reliability in challenging lighting, texture, and motion conditions.

Abstract

This paper explores how deep learning techniques can improve visual-based SLAM performance in challenging environments. By combining deep feature extraction and deep matching methods, we introduce a versatile hybrid visual SLAM system designed to enhance adaptability in challenging scenarios, such as low-light conditions, dynamic lighting, weak-texture areas, and severe jitter. Our system supports multiple modes, including monocular, stereo, monocular-inertial, and stereo-inertial configurations. We also perform analysis how to combine visual SLAM with deep learning methods to enlighten other researches. Through extensive experiments on both public datasets and self-sampled data, we demonstrate the superiority of the SL-SLAM system over traditional approaches. The experimental results show that SL-SLAM outperforms state-of-the-art SLAM algorithms in terms of localization accuracy and tracking robustness. For the benefit of community, we make public the source code at https://github.com/zzzzxxxx111/SLslam.

A real-time, robust and versatile visual-SLAM framework based on deep learning networks

TL;DR

Abstract

Paper Structure (24 sections, 2 equations, 10 figures, 5 tables)

This paper contains 24 sections, 2 equations, 10 figures, 5 tables.

Introduction
Related Work
Traditional visual SLAM works
Deep learning-based visual SLAM works
Method
The System Overview
Adaptive Feature Extraction
Feature Matching and Front-end
Feature Matching network
Tracking combined with learning-based matching
local mapping combined with learning-based method
Loop Closing
Deep bag-of-word descriptor
Loop correction
Experiment
...and 9 more sections

Figures (10)

Figure 1: Rover-SLAM vs other SOTA methods. Top: Comparison of map point tracking performance between the proposed Rover-SLAM and ORB-SLAM3 in a challenging environment. Bottom-Left: Comparison of the trajectories obtained using the ORB-SLAM3, VINS-Mono and Rover-SLAM, in a EuRoc V203 sequence, which is the most challenging sequence for shaking and dynamic lighting reason. Bottom-Right: Map details color-coded with the amount of error, i.e. green corresponds to higher error levels, and blue to lower ones.
Figure 2: The framework of Rover-SLAM.The system consists of the tracking, local mapping and loop closing modules.
Figure 3: Overview of feature extraction. Using SuperPoint as the feature extraction network, which includes an encoder, a feature decoder, and a descriptor decoder. Subsequently, an adaptive filter is applied to sort out feature points.
Figure 4: Comparison of matching performance in different stages. The first and second columns show the matching results of ORB-SLAM3 and Rover-SLAM, respectively. In the baseline system, coarse tracking utilizes project-based matching method, while monocular initialization and map points triangulation utilize DBoWs-based matching method.
Figure 5: Map points comparison between Rover-SLAM and ORB-SLAM3. (a) shows a comparison of the reconstructed sparse point clouds, with rectangular of different colors highlighting the corresponding regions with significant differences. (b) shows a comparison of tracked map points.
...and 5 more figures

A real-time, robust and versatile visual-SLAM framework based on deep learning networks

TL;DR

Abstract

A real-time, robust and versatile visual-SLAM framework based on deep learning networks

Authors

TL;DR

Abstract

Table of Contents

Figures (10)