Table of Contents
Fetching ...

Multi-Expert Learning Framework with the State Space Model for Optical and SAR Image Registration

Wei Wang, Dou Quan, Ning Huyan, Chonghua Lv, Shuang Wang, Yunan Li, Licheng Jiao

TL;DR

This work tackles cross-modal optical–SAR image registration under strong radiometric differences and sparse textures by introducing ME-SSM, a framework that combines a multi-expert learning module (MELF) with a State Space Model backbone (Mamba) and a multi-level feature aggregation (MFA). MELF enriches features by processing geometrically transformed inputs through multiple lightweight experts with a learnable soft router, while Mamba captures global context with a linear-complexity multi-directional cross-scanning strategy and MFA enhances multi-scale fusion. The approach is trained end-to-end with a similarity-based registration objective and a composite loss that enforces robust matching and peak emphasis, achieving state-of-the-art results on SEN1-2 and OS datasets with competitive speed. Extensive ablations validate the contributions of MELF, MFA, and Mamba, and demonstrations show improved feature discriminability and unimodal similarity peaks, highlighting ME-SSM’s potential for robust cross-modal remote sensing alignment and real-world fusion tasks.

Abstract

Optical and Synthetic Aperture Radar (SAR) image registration is crucial for multi-modal image fusion and applications. However, several challenges limit the performance of existing deep learning-based methods in cross-modal image registration: (i) significant nonlinear radiometric variations between optical and SAR images affect the shared feature learning and matching; (ii) limited textures in images hinder discriminative feature extraction; (iii) the local receptive field of Convolutional Neural Networks (CNNs) restricts the learning of contextual information, while the Transformer can capture long-range global features but with high computational complexity. To address these issues, this paper proposes a multi-expert learning framework with the State Space Model (ME-SSM) for optical and SAR image registration. Firstly, to improve the registration performance with limited textures, ME-SSM constructs a multi-expert learning framework to capture shared features from multi-modal images. Specifically, it extracts features from various transformations of the input image and employs a learnable soft router to dynamically fuse these features, thereby enriching feature representations and improving registration performance. Secondly, ME-SSM introduces a state space model, Mamba, for feature extraction, which employs a multi-directional cross-scanning strategy to efficiently capture global contextual relationships with linear complexity. ME-SSM can expand the receptive field, enhance image registration accuracy, and avoid incurring high computational costs. Additionally, ME-SSM uses a multi-level feature aggregation (MFA) module to enhance the multi-scale feature fusion and interaction. Extensive experiments have demonstrated the effectiveness and advantages of our proposed ME-SSM on optical and SAR image registration.

Multi-Expert Learning Framework with the State Space Model for Optical and SAR Image Registration

TL;DR

This work tackles cross-modal optical–SAR image registration under strong radiometric differences and sparse textures by introducing ME-SSM, a framework that combines a multi-expert learning module (MELF) with a State Space Model backbone (Mamba) and a multi-level feature aggregation (MFA). MELF enriches features by processing geometrically transformed inputs through multiple lightweight experts with a learnable soft router, while Mamba captures global context with a linear-complexity multi-directional cross-scanning strategy and MFA enhances multi-scale fusion. The approach is trained end-to-end with a similarity-based registration objective and a composite loss that enforces robust matching and peak emphasis, achieving state-of-the-art results on SEN1-2 and OS datasets with competitive speed. Extensive ablations validate the contributions of MELF, MFA, and Mamba, and demonstrations show improved feature discriminability and unimodal similarity peaks, highlighting ME-SSM’s potential for robust cross-modal remote sensing alignment and real-world fusion tasks.

Abstract

Optical and Synthetic Aperture Radar (SAR) image registration is crucial for multi-modal image fusion and applications. However, several challenges limit the performance of existing deep learning-based methods in cross-modal image registration: (i) significant nonlinear radiometric variations between optical and SAR images affect the shared feature learning and matching; (ii) limited textures in images hinder discriminative feature extraction; (iii) the local receptive field of Convolutional Neural Networks (CNNs) restricts the learning of contextual information, while the Transformer can capture long-range global features but with high computational complexity. To address these issues, this paper proposes a multi-expert learning framework with the State Space Model (ME-SSM) for optical and SAR image registration. Firstly, to improve the registration performance with limited textures, ME-SSM constructs a multi-expert learning framework to capture shared features from multi-modal images. Specifically, it extracts features from various transformations of the input image and employs a learnable soft router to dynamically fuse these features, thereby enriching feature representations and improving registration performance. Secondly, ME-SSM introduces a state space model, Mamba, for feature extraction, which employs a multi-directional cross-scanning strategy to efficiently capture global contextual relationships with linear complexity. ME-SSM can expand the receptive field, enhance image registration accuracy, and avoid incurring high computational costs. Additionally, ME-SSM uses a multi-level feature aggregation (MFA) module to enhance the multi-scale feature fusion and interaction. Extensive experiments have demonstrated the effectiveness and advantages of our proposed ME-SSM on optical and SAR image registration.

Paper Structure

This paper contains 33 sections, 17 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Deep learning frameworks for optical and SAR image registration, which have shown significant performance on images with rich textures, while their performance decreases dramatically when dealing with images with limited textures.
  • Figure 2: The pipeline of the proposed multi-expert learning framework with the State Space Model (ME-SSM) for optical and SAR image registration. ME-SSM utilizes a multi-expert learning framework to dynamically aggregate rich features from different transformed images by various experts and a learnable soft router. ME-SSM employs the State Space Model, Mamba, for feature extraction, which can effectively capture global spatial information and multi-level features. Ultimately, we utilize the normalized SAR feature as a convolution kernel and perform a sliding convolution on the optical feature for fast similarity calculation, thereby achieving the optical and SAR image registration.
  • Figure 3: The pipeline of the multi-expert learning framework.
  • Figure 4: The multi-level feature aggregation (MFA) module comprises two parts: multi-scale adaptive aggregation (MSAA) and channel aggregation (CA), which can further enhance feature representation.
  • Figure 5: The similarity map $S$. The positive sample region is a soft-label region obtained by applying Gaussian smoothing centered on the matching point. The negative samples used for deep model optimization are selected from the $top k$ similarity candidates that fall outside of the positive sample region.
  • ...and 7 more figures