Table of Contents
Fetching ...

MambaGlue: Fast and Robust Local Feature Matching With Mamba

Kihwan Ryoo, Hyungtae Lim, Hyun Myung

TL;DR

MambaGlue addresses the need for fast, robust local feature matching by hybridizing the Mamba architecture with Transformer-style attention. It introduces two novel components—a MambaAttention mixer that selectively aggregates context over $d$-dimensional feature states $\mathbf{x}_q \in \mathbb{R}^d$ and a deep confidence score regressor that outputs a per-feature score $c_q \in (0,1)$—and an exit test for early stopping to prune unreliable features. Trained in two stages, it achieves higher accuracy with low latency on HPatches, MegaDepth-1500, and Aachen Day-Night compared with strong baselines such as LightGlue and SuperGlue. The results demonstrate practical impact for robust camera pose estimation and visual localization under challenging illumination and viewpoint changes.

Abstract

In recent years, robust matching methods using deep learning-based approaches have been actively studied and improved in computer vision tasks. However, there remains a persistent demand for both robust and fast matching techniques. To address this, we propose a novel Mamba-based local feature matching approach, called MambaGlue, where Mamba is an emerging state-of-the-art architecture rapidly gaining recognition for its superior speed in both training and inference, and promising performance compared with Transformer architectures. In particular, we propose two modules: a) MambaAttention mixer to simultaneously and selectively understand the local and global context through the Mamba-based self-attention structure and b) deep confidence score regressor, which is a multi-layer perceptron (MLP)-based architecture that evaluates a score indicating how confidently matching predictions correspond to the ground-truth correspondences. Consequently, our MambaGlue achieves a balance between robustness and efficiency in real-world applications. As verified on various public datasets, we demonstrate that our MambaGlue yields a substantial performance improvement over baseline approaches while maintaining fast inference speed. Our code will be available on https://github.com/url-kaist/MambaGlue

MambaGlue: Fast and Robust Local Feature Matching With Mamba

TL;DR

MambaGlue addresses the need for fast, robust local feature matching by hybridizing the Mamba architecture with Transformer-style attention. It introduces two novel components—a MambaAttention mixer that selectively aggregates context over -dimensional feature states and a deep confidence score regressor that outputs a per-feature score —and an exit test for early stopping to prune unreliable features. Trained in two stages, it achieves higher accuracy with low latency on HPatches, MegaDepth-1500, and Aachen Day-Night compared with strong baselines such as LightGlue and SuperGlue. The results demonstrate practical impact for robust camera pose estimation and visual localization under challenging illumination and viewpoint changes.

Abstract

In recent years, robust matching methods using deep learning-based approaches have been actively studied and improved in computer vision tasks. However, there remains a persistent demand for both robust and fast matching techniques. To address this, we propose a novel Mamba-based local feature matching approach, called MambaGlue, where Mamba is an emerging state-of-the-art architecture rapidly gaining recognition for its superior speed in both training and inference, and promising performance compared with Transformer architectures. In particular, we propose two modules: a) MambaAttention mixer to simultaneously and selectively understand the local and global context through the Mamba-based self-attention structure and b) deep confidence score regressor, which is a multi-layer perceptron (MLP)-based architecture that evaluates a score indicating how confidently matching predictions correspond to the ground-truth correspondences. Consequently, our MambaGlue achieves a balance between robustness and efficiency in real-world applications. As verified on various public datasets, we demonstrate that our MambaGlue yields a substantial performance improvement over baseline approaches while maintaining fast inference speed. Our code will be available on https://github.com/url-kaist/MambaGlue

Paper Structure

This paper contains 18 sections, 9 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Qualitative comparison of matching performance between LightGlue lindenberger2023lightglue and our proposed method called MambaGlue on outdoor visual localization, given exactly the same keypoints and initial descriptors provided by SuperPoint detone2018superpoint under the same threshold parameters. Note that our MambaGlue demonstrates more robust matching performance even under challenging conditions, such as illumination changes, increasing the inlier ratio within the final correspondences.
  • Figure 2: (a) Overview of the proposed feature matching pipeline called MambaGlue. Pair sets of local feature points and their descriptors ($\mathbf{p}^I, \mathbf{d}^I$), where $I\in\{A,B\}$, pass through layers sequentially from $L_1$ to $L_N$, with an exit test at the end of each layer except the last layer. (b) Description of the $n$-th layer in the pipeline, which mainly consists of a succession of a MambaAttention mixer, a cross-attention, and a deep confidence score regressor. Each layer augments the states $\mathbf{x}^A$ and $\mathbf{x}^B$, which are initialized by the local visual descriptors $\mathbf{d}^A$ and $\mathbf{d}^B$, respectively, i.e. $\mathbf{x}^A \leftarrow \mathbf{d}^A$ and $\mathbf{x}^B \leftarrow \mathbf{d}^B$, with global context as they pass through a MambaAttention mixer and a cross-attention. At the end of the $L_n$ layer, where $n\in\{1,\ldots,N-1\}$, a deep confidence score regressor outputs the confidence scores set $\mathbf{c}_n$ to predict whether the current $n$-th matching prediction is sufficiently reliable. (c) Diagram of the exit test. At the end of every layer, it decides whether to halt the process based on the confidence score. If enough number of features are confident for matching, MambaGlue stops the iteration and performs feature matching; otherwise, the iteration proceeds after pruning potentially unreliable features.
  • Figure 3: (a) The architecture of the MambaVision block hatamizadeh2024mambavision, which can only take an image as input and thus cannot be directly used for feature matching tasks, and (b) our proposed MambaAttention mixer block, which takes feature points and states from descriptors as input. Our MambaAttention mixer mainly consists of three branches: (i) a self-attention block with positional encoding for point input $\mathbf{p}_q$, (ii) a direct connection of the input to preserve the original feature, and (iii) a Mamba-based block, which is inspired by (a). Then, the features are concatenated at the end of the block to selectively and holistically provide the refined context for the next stage.
  • Figure 4: The loss and recall graph for the pre-training process of MambaGlue. After training on 5M image pairs (only 2 GPU-days), our MambaGlue achieves (a) 26.7% lower loss at the final layer and (b) 0.3% higher match recall than LightGlue.
  • Figure 5: The AUC graph of reprojection error with varying exit thresholds $\alpha$ on the HPatches dataset balntas2017hpatches when using direct linear transformation (DLT) and RANSAC with varying thresholds: (a) 1 pixel, (b) 3 pixels, (c) 5 pixels.
  • ...and 1 more figures