Table of Contents
Fetching ...

Multi-Head Attention Residual Unfolded Network for Model-Based Pansharpening

Ivan Pereira-Sánchez, Eloi Sans, Julia Navarro, Joan Duran

TL;DR

This work addresses the challenge of fusing a high-resolution PAN image with a lower-resolution MS/HS image by proposing a model-based deep unfolded framework. It introduces a variational energy with a high-frequency PAN injection term penalized by the $L^1$ norm and unfolds the resulting primal-dual optimization into MARNet-based stages, where multi-head attention captures nonlocal self-similarity across patches. The paper demonstrates strong generalization across PRISMA, QuickBird, and WorldView2 datasets, outperforming many baselines and maintaining robustness to different sampling factors and noise levels, with code released publicly. The approach offers a principled, interpretable, and adaptable solution for high-quality pansharpening and hypersharpening in diverse sensor configurations.

Abstract

The objective of pansharpening and hypersharpening is to accurately combine a high-resolution panchromatic (PAN) image with a low-resolution multispectral (MS) or hyperspectral (HS) image, respectively. Unfolding fusion methods integrate the powerful representation capabilities of deep learning with the robustness of model-based approaches. These techniques involve unrolling the steps of the optimization scheme derived from the minimization of an energy into a deep learning framework, resulting in efficient and highly interpretable architectures. In this paper, we propose a model-based deep unfolded method for satellite image fusion. Our approach is based on a variational formulation that incorporates the classic observation model for MS/HS data, a high-frequency injection constraint based on the PAN image, and an arbitrary convex prior. For the unfolding stage, we introduce upsampling and downsampling layers that use geometric information encoded in the PAN image through residual networks. The backbone of our method is a multi-head attention residual network (MARNet), which replaces the proximity operator in the optimization scheme and combines multiple head attentions with residual learning to exploit image self-similarities via nonlocal operators defined in terms of patches. Additionally, we incorporate a post-processing module based on the MARNet architecture to further enhance the quality of the fused images. Experimental results on PRISMA, Quickbird, and WorldView2 datasets demonstrate the superior performance of our method and its ability to generalize across different sensor configurations and varying spatial and spectral resolutions. The source code will be available at https://github.com/TAMI-UIB/MARNet.

Multi-Head Attention Residual Unfolded Network for Model-Based Pansharpening

TL;DR

This work addresses the challenge of fusing a high-resolution PAN image with a lower-resolution MS/HS image by proposing a model-based deep unfolded framework. It introduces a variational energy with a high-frequency PAN injection term penalized by the norm and unfolds the resulting primal-dual optimization into MARNet-based stages, where multi-head attention captures nonlocal self-similarity across patches. The paper demonstrates strong generalization across PRISMA, QuickBird, and WorldView2 datasets, outperforming many baselines and maintaining robustness to different sampling factors and noise levels, with code released publicly. The approach offers a principled, interpretable, and adaptable solution for high-quality pansharpening and hypersharpening in diverse sensor configurations.

Abstract

The objective of pansharpening and hypersharpening is to accurately combine a high-resolution panchromatic (PAN) image with a low-resolution multispectral (MS) or hyperspectral (HS) image, respectively. Unfolding fusion methods integrate the powerful representation capabilities of deep learning with the robustness of model-based approaches. These techniques involve unrolling the steps of the optimization scheme derived from the minimization of an energy into a deep learning framework, resulting in efficient and highly interpretable architectures. In this paper, we propose a model-based deep unfolded method for satellite image fusion. Our approach is based on a variational formulation that incorporates the classic observation model for MS/HS data, a high-frequency injection constraint based on the PAN image, and an arbitrary convex prior. For the unfolding stage, we introduce upsampling and downsampling layers that use geometric information encoded in the PAN image through residual networks. The backbone of our method is a multi-head attention residual network (MARNet), which replaces the proximity operator in the optimization scheme and combines multiple head attentions with residual learning to exploit image self-similarities via nonlocal operators defined in terms of patches. Additionally, we incorporate a post-processing module based on the MARNet architecture to further enhance the quality of the fused images. Experimental results on PRISMA, Quickbird, and WorldView2 datasets demonstrate the superior performance of our method and its ability to generalize across different sensor configurations and varying spatial and spectral resolutions. The source code will be available at https://github.com/TAMI-UIB/MARNet.
Paper Structure (23 sections, 19 equations, 13 figures, 4 tables)

This paper contains 23 sections, 19 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: (a) Overall architecture of the proposed method. (b) A single primal-dual stage. (c) Post-processing module. (d) Initialization module.
  • Figure 2: Diagrams (a) and (b) display the Down and Up operators for sampling 4, while (c) and (d) show those for sampling 12. Diagram (e) depicts the architecture of the proposed geometry injection layer.
  • Figure 3: (a) Proposed MARNet. (b) Multi-Head Attention (MHA) module, which is composed of three Head Attention (HA) layers. (c) A single HA layer.
  • Figure 4: Each graphic compares the rank positions based on the PSNR between the validation (horitzontal axis) and the testing (vertical axis) sets. Accordingly, a method positions lower and further to the left indicates better performance. In (d), the mean rank across the three datasets is displayed. It is observed that the proposed fusion method performs the best in all cases.
  • Figure 5: Quantitative metrics obtained for each fusion method on the PRISMA validation and testing sets. The best values are highlighted in bold, the second best in blue and the third best in red. Our approach achieves the best results for all metrics except SSIM, where SRPPNN performs better.
  • ...and 8 more figures