Table of Contents
Fetching ...

MMMamba: A Versatile Cross-Modal In Context Fusion Framework for Pan-Sharpening and Zero-Shot Image Enhancement

Yingying Wang, Xuanhua He, Chen Wu, Jialing Huang, Suiyun Zhang, Rui Liu, Xinghao Ding, Haoxuan Che

TL;DR

The paper tackles pan-sharpening by fusing high-resolution PAN with low-resolution MS to obtain HRMS with preserved spectral fidelity and sharp spatial details. It introduces MMMamba, a cross-modal in-context fusion framework built on the Mamba architecture, achieving linear computational complexity and bidirectional cross-modal interaction via a novel Multimodal Interleaved (MI) scanning mechanism. MMMamba supports zero-shot MS image super-resolution by simply omitting the PAN input, demonstrating flexible cross-modal generalization. Across WV2, GF2, and WV3 benchmarks, MMMamba outperforms state-of-the-art methods on reduced- and full-resolution tasks, while maintaining computational efficiency and robust cross-modal fusion performance.

Abstract

Pan-sharpening aims to generate high-resolution multispectral (HRMS) images by integrating a high-resolution panchromatic (PAN) image with its corresponding low-resolution multispectral (MS) image. To achieve effective fusion, it is crucial to fully exploit the complementary information between the two modalities. Traditional CNN-based methods typically rely on channel-wise concatenation with fixed convolutional operators, which limits their adaptability to diverse spatial and spectral variations. While cross-attention mechanisms enable global interactions, they are computationally inefficient and may dilute fine-grained correspondences, making it difficult to capture complex semantic relationships. Recent advances in the Multimodal Diffusion Transformer (MMDiT) architecture have demonstrated impressive success in image generation and editing tasks. Unlike cross-attention, MMDiT employs in-context conditioning to facilitate more direct and efficient cross-modal information exchange. In this paper, we propose MMMamba, a cross-modal in-context fusion framework for pan-sharpening, with the flexibility to support image super-resolution in a zero-shot manner. Built upon the Mamba architecture, our design ensures linear computational complexity while maintaining strong cross-modal interaction capacity. Furthermore, we introduce a novel multimodal interleaved (MI) scanning mechanism that facilitates effective information exchange between the PAN and MS modalities. Extensive experiments demonstrate the superior performance of our method compared to existing state-of-the-art (SOTA) techniques across multiple tasks and benchmarks.

MMMamba: A Versatile Cross-Modal In Context Fusion Framework for Pan-Sharpening and Zero-Shot Image Enhancement

TL;DR

The paper tackles pan-sharpening by fusing high-resolution PAN with low-resolution MS to obtain HRMS with preserved spectral fidelity and sharp spatial details. It introduces MMMamba, a cross-modal in-context fusion framework built on the Mamba architecture, achieving linear computational complexity and bidirectional cross-modal interaction via a novel Multimodal Interleaved (MI) scanning mechanism. MMMamba supports zero-shot MS image super-resolution by simply omitting the PAN input, demonstrating flexible cross-modal generalization. Across WV2, GF2, and WV3 benchmarks, MMMamba outperforms state-of-the-art methods on reduced- and full-resolution tasks, while maintaining computational efficiency and robust cross-modal fusion performance.

Abstract

Pan-sharpening aims to generate high-resolution multispectral (HRMS) images by integrating a high-resolution panchromatic (PAN) image with its corresponding low-resolution multispectral (MS) image. To achieve effective fusion, it is crucial to fully exploit the complementary information between the two modalities. Traditional CNN-based methods typically rely on channel-wise concatenation with fixed convolutional operators, which limits their adaptability to diverse spatial and spectral variations. While cross-attention mechanisms enable global interactions, they are computationally inefficient and may dilute fine-grained correspondences, making it difficult to capture complex semantic relationships. Recent advances in the Multimodal Diffusion Transformer (MMDiT) architecture have demonstrated impressive success in image generation and editing tasks. Unlike cross-attention, MMDiT employs in-context conditioning to facilitate more direct and efficient cross-modal information exchange. In this paper, we propose MMMamba, a cross-modal in-context fusion framework for pan-sharpening, with the flexibility to support image super-resolution in a zero-shot manner. Built upon the Mamba architecture, our design ensures linear computational complexity while maintaining strong cross-modal interaction capacity. Furthermore, we introduce a novel multimodal interleaved (MI) scanning mechanism that facilitates effective information exchange between the PAN and MS modalities. Extensive experiments demonstrate the superior performance of our method compared to existing state-of-the-art (SOTA) techniques across multiple tasks and benchmarks.

Paper Structure

This paper contains 24 sections, 28 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The overall framework of our proposed MMMamba, the first exploration of in-context conditioning paradigm in pan-sharpening. This framework enables bidirectional information flow between PAN and MS modalities and supports zero-shot generalization to task like image super-resolution. The proposed MI scanning strategy captures complementary information and facilitates effective cross-modal interaction.
  • Figure 2: Visual comparison of all methods on WV3. The last row visualizes the MSE residues between the pan-sharpening results and the ground truth.
  • Figure 3: The visual comparison of the zero-shot image super-resolution results on the WV2 dataset.