Table of Contents
Fetching ...

LGU-SLAM: Learnable Gaussian Uncertainty Matching with Deformable Correlation Sampling for Deep Visual SLAM

Yucheng Huang, Luping Ji, Hudong Liu, Mao Ye

TL;DR

This work proposes a Learnable Gaussian Uncertainty (LGU) matching, a learnable 2D Gaussian uncertainty model designed to associate matching-frame pairs that mainly focuses on precise correspondence construction.

Abstract

Deep visual Simultaneous Localization and Mapping (SLAM) techniques, e.g., DROID, have made significant advancements by leveraging deep visual odometry on dense flow fields. In general, they heavily rely on global visual similarity matching. However, the ambiguous similarity interference in uncertain regions could often lead to excessive noise in correspondences, ultimately misleading SLAM in geometric modeling. To address this issue, we propose a Learnable Gaussian Uncertainty (LGU) matching. It mainly focuses on precise correspondence construction. In our scheme, a learnable 2D Gaussian uncertainty model is designed to associate matching-frame pairs. It could generate input-dependent Gaussian distributions for each correspondence map. Additionally, a multi-scale deformable correlation sampling strategy is devised to adaptively fine-tune the sampling of each direction by a priori look-up ranges, enabling reliable correlation construction. Furthermore, a KAN-bias GRU component is adopted to improve a temporal iterative enhancement for accomplishing sophisticated spatio-temporal modeling with limited parameters. The extensive experiments on real-world and synthetic datasets are conducted to validate the effectiveness and superiority of our method.

LGU-SLAM: Learnable Gaussian Uncertainty Matching with Deformable Correlation Sampling for Deep Visual SLAM

TL;DR

This work proposes a Learnable Gaussian Uncertainty (LGU) matching, a learnable 2D Gaussian uncertainty model designed to associate matching-frame pairs that mainly focuses on precise correspondence construction.

Abstract

Deep visual Simultaneous Localization and Mapping (SLAM) techniques, e.g., DROID, have made significant advancements by leveraging deep visual odometry on dense flow fields. In general, they heavily rely on global visual similarity matching. However, the ambiguous similarity interference in uncertain regions could often lead to excessive noise in correspondences, ultimately misleading SLAM in geometric modeling. To address this issue, we propose a Learnable Gaussian Uncertainty (LGU) matching. It mainly focuses on precise correspondence construction. In our scheme, a learnable 2D Gaussian uncertainty model is designed to associate matching-frame pairs. It could generate input-dependent Gaussian distributions for each correspondence map. Additionally, a multi-scale deformable correlation sampling strategy is devised to adaptively fine-tune the sampling of each direction by a priori look-up ranges, enabling reliable correlation construction. Furthermore, a KAN-bias GRU component is adopted to improve a temporal iterative enhancement for accomplishing sophisticated spatio-temporal modeling with limited parameters. The extensive experiments on real-world and synthetic datasets are conducted to validate the effectiveness and superiority of our method.

Paper Structure

This paper contains 24 sections, 10 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Comparison of typical correspondence construction scheme and ours: (a) correlation volumes are directly established by correspondence maps through cross-attention, with a fixed range sampling; (b) our scheme adopts learnable Gaussian uncertainty to suppress the outliers in correlation volumes, with input-dependent deformable sampling to improve correlation context range.
  • Figure 2: Global Overview of our proposed LGU-SLAM. (1) Deep feature extraction. It utilizes video frames to input networks for deep semantic abstractions. (2) Correspondence construction. Firstly, It utilizes the bipartite graph to index the feature sequences $f_{l}$ for computing correlation volumes and devising an MLP based decoder outputs 2D Gaussian for all correspondence maps to generate Gaussian uncertainty masks that suppress outlier visual similarities. Secondly, It utilizes the proposed multi-scale deformable correlation sampling to enhance the contextual construction with the input-dependent sampling range. (3) Temporal iterative enhancement. It utilizes the designed KAN-bias GRU to perform temporal iterative enhancement, which is combined with dense bundle adjustment(DBA) for optimizing pose and depth information.
  • Figure 3: Generation of multi-scale offset (layer s).
  • Figure 4: Uncertainty based filtering in predicted offsets. To suppress redundant offsets, the final filtered offset is obtained by point-wise multiplication of the soft mask and the predicted offset tensor.
  • Figure 5: Comparison results on the ETH3D-test RGB-D benchmark, ATE[cm]. Our method can achieve excellent generalization effect without fine-tuning on ETH3D like DVI-SLAM, and the test results can be found in https://www.eth3d.net/slam_benchmark
  • ...and 2 more figures