Multi-Expert Learning Framework with the State Space Model for Optical and SAR Image Registration
Wei Wang, Dou Quan, Ning Huyan, Chonghua Lv, Shuang Wang, Yunan Li, Licheng Jiao
TL;DR
This work tackles cross-modal optical–SAR image registration under strong radiometric differences and sparse textures by introducing ME-SSM, a framework that combines a multi-expert learning module (MELF) with a State Space Model backbone (Mamba) and a multi-level feature aggregation (MFA). MELF enriches features by processing geometrically transformed inputs through multiple lightweight experts with a learnable soft router, while Mamba captures global context with a linear-complexity multi-directional cross-scanning strategy and MFA enhances multi-scale fusion. The approach is trained end-to-end with a similarity-based registration objective and a composite loss that enforces robust matching and peak emphasis, achieving state-of-the-art results on SEN1-2 and OS datasets with competitive speed. Extensive ablations validate the contributions of MELF, MFA, and Mamba, and demonstrations show improved feature discriminability and unimodal similarity peaks, highlighting ME-SSM’s potential for robust cross-modal remote sensing alignment and real-world fusion tasks.
Abstract
Optical and Synthetic Aperture Radar (SAR) image registration is crucial for multi-modal image fusion and applications. However, several challenges limit the performance of existing deep learning-based methods in cross-modal image registration: (i) significant nonlinear radiometric variations between optical and SAR images affect the shared feature learning and matching; (ii) limited textures in images hinder discriminative feature extraction; (iii) the local receptive field of Convolutional Neural Networks (CNNs) restricts the learning of contextual information, while the Transformer can capture long-range global features but with high computational complexity. To address these issues, this paper proposes a multi-expert learning framework with the State Space Model (ME-SSM) for optical and SAR image registration. Firstly, to improve the registration performance with limited textures, ME-SSM constructs a multi-expert learning framework to capture shared features from multi-modal images. Specifically, it extracts features from various transformations of the input image and employs a learnable soft router to dynamically fuse these features, thereby enriching feature representations and improving registration performance. Secondly, ME-SSM introduces a state space model, Mamba, for feature extraction, which employs a multi-directional cross-scanning strategy to efficiently capture global contextual relationships with linear complexity. ME-SSM can expand the receptive field, enhance image registration accuracy, and avoid incurring high computational costs. Additionally, ME-SSM uses a multi-level feature aggregation (MFA) module to enhance the multi-scale feature fusion and interaction. Extensive experiments have demonstrated the effectiveness and advantages of our proposed ME-SSM on optical and SAR image registration.
