Table of Contents
Fetching ...

CMT: Cross Modulation Transformer with Hybrid Loss for Pansharpening

Wen-Jie Shu, Hong-Xia Dou, Rui Wen, Xiao Wu, Liang-Jian Deng

TL;DR

The paper addresses pansharpening by introducing a Cross Modulation Transformer (CMT) that fuses PAN and LRMS information through a novel cross modulation in the attention mechanism. It combines a three-phase architecture (feature extraction, CMAB-based modulation, and aggregation) with a Fourier–Wavelet hybrid loss to capture both global structure and local textures. Key contributions include the Cross Modulation Attention Block (CMAB) and a loss that blends Fourier and wavelet components to improve spatial detail and spectral fidelity. Experiments on WV3 and GF2 datasets demonstrate state-of-the-art performance, with public code to enable further research and application potential beyond pansharpening.

Abstract

Pansharpening aims to enhance remote sensing image (RSI) quality by merging high-resolution panchromatic (PAN) with multispectral (MS) images. However, prior techniques struggled to optimally fuse PAN and MS images for enhanced spatial and spectral information, due to a lack of a systematic framework capable of effectively coordinating their individual strengths. In response, we present the Cross Modulation Transformer (CMT), a pioneering method that modifies the attention mechanism. This approach utilizes a robust modulation technique from signal processing, integrating it into the attention mechanism's calculations. It dynamically tunes the weights of the carrier's value (V) matrix according to the modulator's features, thus resolving historical challenges and achieving a seamless integration of spatial and spectral attributes. Furthermore, considering that RSI exhibits large-scale features and edge details along with local textures, we crafted a hybrid loss function that combines Fourier and wavelet transforms to effectively capture these characteristics, thereby enhancing both spatial and spectral accuracy in pansharpening. Extensive experiments demonstrate our framework's superior performance over existing state-of-the-art methods. The code will be publicly available to encourage further research.

CMT: Cross Modulation Transformer with Hybrid Loss for Pansharpening

TL;DR

The paper addresses pansharpening by introducing a Cross Modulation Transformer (CMT) that fuses PAN and LRMS information through a novel cross modulation in the attention mechanism. It combines a three-phase architecture (feature extraction, CMAB-based modulation, and aggregation) with a Fourier–Wavelet hybrid loss to capture both global structure and local textures. Key contributions include the Cross Modulation Attention Block (CMAB) and a loss that blends Fourier and wavelet components to improve spatial detail and spectral fidelity. Experiments on WV3 and GF2 datasets demonstrate state-of-the-art performance, with public code to enable further research and application potential beyond pansharpening.

Abstract

Pansharpening aims to enhance remote sensing image (RSI) quality by merging high-resolution panchromatic (PAN) with multispectral (MS) images. However, prior techniques struggled to optimally fuse PAN and MS images for enhanced spatial and spectral information, due to a lack of a systematic framework capable of effectively coordinating their individual strengths. In response, we present the Cross Modulation Transformer (CMT), a pioneering method that modifies the attention mechanism. This approach utilizes a robust modulation technique from signal processing, integrating it into the attention mechanism's calculations. It dynamically tunes the weights of the carrier's value (V) matrix according to the modulator's features, thus resolving historical challenges and achieving a seamless integration of spatial and spectral attributes. Furthermore, considering that RSI exhibits large-scale features and edge details along with local textures, we crafted a hybrid loss function that combines Fourier and wavelet transforms to effectively capture these characteristics, thereby enhancing both spatial and spectral accuracy in pansharpening. Extensive experiments demonstrate our framework's superior performance over existing state-of-the-art methods. The code will be publicly available to encourage further research.
Paper Structure (10 sections, 9 equations, 4 figures, 4 tables)

This paper contains 10 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: (a) The modulation process in the field of communication. (b) The modulation process in coded aperture snapshot spectral imaging (CASSI). (c) Our Cross Modulation Multi-head Self-Attention (CM-MSA) modulation process.
  • Figure 2: Overall structure of the proposed method.
  • Figure 3: (a) The CMAB module consists of a Double Feed-Forward Network (DFFN), a CM-MSA module, and two layers of normalization. (b) Components of the DFFN. (c) The hybrid loss between Predicted Images ($PI$) and ground truth ($GT$), which employs both the 2D Discrete Fourier Transform (DFT) and the 2D Discrete Wavelet Transform (DWT).
  • Figure 4: Qualitative result comparison between representative methods on the GF2 reduced-resolution dataset. The first row presents RGB outputs, while the second row gives the corresponding QNR maps.