Table of Contents
Fetching ...

Light-SLAM: A Robust Deep-Learning Visual SLAM System Based on LightGlue under Challenging Lighting Conditions

Zhiqi Zhao, Chang Wu, Xiaotong Kong, Zejie Lv, Xiaoqi Du, Qiyan Li

TL;DR

Light-SLAM presents a hybrid visual SLAM system that integrates LightGlue-based deep local descriptors with traditional geometry to improve robustness under challenging lighting. It replaces hand-crafted features with deep local features, employs an attention-based matching pipeline, and uses an optimized parallel image pyramid plus a stereo depth module to maintain real-time performance on GPU. Across KITTI, EuRoC, TUM, 4Season, and real campus datasets, Light-SLAM consistently outperforms traditional ORB-SLAM2 and several deep-learning–only baselines, especially in low-light and high-contrast conditions. The results indicate meaningful improvements in accuracy and robustness, with practical impact for autonomous systems operating under variable illumination.

Abstract

Simultaneous Localization and Mapping (SLAM) has become a critical technology for intelligent transportation systems and autonomous robots and is widely used in autonomous driving. However, traditional manual feature-based methods in challenging lighting environments make it difficult to ensure robustness and accuracy. Some deep learning-based methods show potential but still have significant drawbacks. To address this problem, we propose a novel hybrid system for visual SLAM based on the LightGlue deep learning network. It uses deep local feature descriptors to replace traditional hand-crafted features and a more efficient and accurate deep network to achieve fast and precise feature matching. Thus, we use the robustness of deep learning to improve the whole system. We have combined traditional geometry-based approaches to introduce a complete visual SLAM system for monocular, binocular, and RGB-D sensors. We thoroughly tested the proposed system on four public datasets: KITTI, EuRoC, TUM, and 4Season, as well as on actual campus scenes. The experimental results show that the proposed method exhibits better accuracy and robustness in adapting to low-light and strongly light-varying environments than traditional manual features and deep learning-based methods. It can also run on GPU in real time.

Light-SLAM: A Robust Deep-Learning Visual SLAM System Based on LightGlue under Challenging Lighting Conditions

TL;DR

Light-SLAM presents a hybrid visual SLAM system that integrates LightGlue-based deep local descriptors with traditional geometry to improve robustness under challenging lighting. It replaces hand-crafted features with deep local features, employs an attention-based matching pipeline, and uses an optimized parallel image pyramid plus a stereo depth module to maintain real-time performance on GPU. Across KITTI, EuRoC, TUM, 4Season, and real campus datasets, Light-SLAM consistently outperforms traditional ORB-SLAM2 and several deep-learning–only baselines, especially in low-light and high-contrast conditions. The results indicate meaningful improvements in accuracy and robustness, with practical impact for autonomous systems operating under variable illumination.

Abstract

Simultaneous Localization and Mapping (SLAM) has become a critical technology for intelligent transportation systems and autonomous robots and is widely used in autonomous driving. However, traditional manual feature-based methods in challenging lighting environments make it difficult to ensure robustness and accuracy. Some deep learning-based methods show potential but still have significant drawbacks. To address this problem, we propose a novel hybrid system for visual SLAM based on the LightGlue deep learning network. It uses deep local feature descriptors to replace traditional hand-crafted features and a more efficient and accurate deep network to achieve fast and precise feature matching. Thus, we use the robustness of deep learning to improve the whole system. We have combined traditional geometry-based approaches to introduce a complete visual SLAM system for monocular, binocular, and RGB-D sensors. We thoroughly tested the proposed system on four public datasets: KITTI, EuRoC, TUM, and 4Season, as well as on actual campus scenes. The experimental results show that the proposed method exhibits better accuracy and robustness in adapting to low-light and strongly light-varying environments than traditional manual features and deep learning-based methods. It can also run on GPU in real time.
Paper Structure (24 sections, 14 equations, 13 figures, 8 tables, 1 algorithm)

This paper contains 24 sections, 14 equations, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: The Light-SLAM system framework.We extract local feature points and compute corresponding descriptors using a SuperPoint network, denoted as $(d, p)$ for an image pair input system. We use a multi-layer network based on self-attention and cross-attention to update the feature state. We characterize the similarity and matchability between the points based on the attention score and complete the assignment between the predicted points to obtain the optimal matching result. Inputs to the threads running parallel in the system: tracking, local mapping, and loop closing.
  • Figure 2: The deep learning feature extractor architecture. It consists of a shared encoder and two decoders. Both decoders operate on a shared space-reduced representation of the input. One trains to detect feature points, and the other to detect corresponding descriptors.
  • Figure 3: Optimized parallel image pyramid model. The input image is progressively down-sampled with a scaling factor $\lambda$ to obtain images of different resolutions, and feature point extraction is done in parallel at each layer of the pyramid.
  • Figure 4: Multi-scale feature point matching results. Correlate the number of feature points that should be extracted for each layer of the image with the size of the area of that layer, obtaining feature points that cover a wider range of scales.
  • Figure 5: The framework for stereo depth estimation module. By preprocessing the input stereo image pairs, calculating the cost function, and obtaining the disparity estimation and the depth image. The accurate matching results of the stereo images using the LightGlue network are compared with the depth image to output the spatial points corresponding to the feature points.
  • ...and 8 more figures