Selecting and Pruning: A Differentiable Causal Sequentialized State-Space Model for Two-View Correspondence Learning
Xiang Fang, Shihua Zhang, Hao Zhang, Tao Lu, Huabing Zhou, Jiayi Ma
TL;DR
CorrMamba addresses two-view correspondence learning by integrating differentiable causal sequence learning with a local context graph module and a channel-aware Mamba filter to selectively mine information from true correspondences while suppressing outliers. The architecture comprises a Causal Sequence Learning Block, Local Graph Pattern Learning, and Channel-Aware Mamba Filter, culminating in an inlier predictor and robust essential matrix estimation. Empirical results on outdoor relative pose estimation and visual localization demonstrate state-of-the-art performance, including a notable 2.58 percentage point improvement in AUC@20° outdoors, along with strong ablations validating each component. The work delivers a scalable, permutation-invariant, and generalizable framework that effectively prunes mismatches while preserving geometry-critical information for downstream vision tasks.
Abstract
Two-view correspondence learning aims to discern true and false correspondences between image pairs by recognizing their underlying different information. Previous methods either treat the information equally or require the explicit storage of the entire context, tending to be laborious in real-world scenarios. Inspired by Mamba's inherent selectivity, we propose \textbf{CorrMamba}, a \textbf{Corr}espondence filter leveraging \textbf{Mamba}'s ability to selectively mine information from true correspondences while mitigating interference from false ones, thus achieving adaptive focus at a lower cost. To prevent Mamba from being potentially impacted by unordered keypoints that obscured its ability to mine spatial information, we customize a causal sequential learning approach based on the Gumbel-Softmax technique to establish causal dependencies between features in a fully autonomous and differentiable manner. Additionally, a local-context enhancement module is designed to capture critical contextual cues essential for correspondence pruning, complementing the core framework. Extensive experiments on relative pose estimation, visual localization, and analysis demonstrate that CorrMamba achieves state-of-the-art performance. Notably, in outdoor relative pose estimation, our method surpasses the previous SOTA by $2.58$ absolute percentage points in AUC@20\textdegree, highlighting its practical superiority. Our code will be publicly available.
