Table of Contents
Fetching ...

Unsupervised Hyperspectral and Multispectral Image Fusion via Self-Supervised Modality Decoupling

Songcheng Du, Yang Zou, Zixu Wang, Xingyuan Li, Ying Li, Changjing Shang, Qiang Shen

TL;DR

This paper tackles unsupervised hyperspectral and multispectral image fusion (HMIF) by introducing MossFuse, a modality-decoupled framework that explicitly separates shared LR-MSI representations from modality-complementary spatial and spectral details. It advances the field with a subspace clustering loss to guide decoupling, a self-supervised constraint to enforce representation fidelity, a physics-based degradation estimation module, and an efficient modality-aggregation scheme that yields high-fidelity HR-HSIs with fewer parameters and faster inference. Experimental results across multiple synthetic and real datasets show MossFuse outperforms existing linear and deep-learning HMIF methods on key metrics (PSNR, SSIM, SAM, ERGAS), while maintaining robustness under varied degradations. The approach offers practical impact for real-world HMIF tasks by delivering accurate, spectrally faithful HR-HSIs with reduced computational costs, and sets the stage for future integration with large-scale pretraining and semantic supervision.

Abstract

Hyperspectral and Multispectral Image Fusion (HMIF) aims to fuse low-resolution hyperspectral images (LR-HSIs) and high-resolution multispectral images (HR-MSIs) to reconstruct high spatial and high spectral resolution images. Current methods typically apply direct fusion from the two modalities without effective supervision, leading to an incomplete perception of deep modality-complementary information and a limited understanding of inter-modality correlations. To address these issues, we propose a simple yet effective solution for unsupervised HMIF, revealing that modality decoupling is key to improving fusion performance. Specifically, we propose an end-to-end self-supervised \textbf{Mo}dality-Decoupled \textbf{S}patial-\textbf{S}pectral Fusion (\textbf{MossFuse}) framework that decouples shared and complementary information across modalities and aggregates a concise representation of both LR-HSIs and HR-MSIs to reduce modality redundancy. Also, we introduce the subspace clustering loss as a clear guide to decouple modality-shared features from modality-complementary ones. Systematic experiments over multiple datasets demonstrate that our simple and effective approach consistently outperforms the existing HMIF methods while requiring considerably fewer parameters with reduced inference time. The anonymous source code is in \href{https://github.com/dusongcheng/MossFuse}{MossFuse}.

Unsupervised Hyperspectral and Multispectral Image Fusion via Self-Supervised Modality Decoupling

TL;DR

This paper tackles unsupervised hyperspectral and multispectral image fusion (HMIF) by introducing MossFuse, a modality-decoupled framework that explicitly separates shared LR-MSI representations from modality-complementary spatial and spectral details. It advances the field with a subspace clustering loss to guide decoupling, a self-supervised constraint to enforce representation fidelity, a physics-based degradation estimation module, and an efficient modality-aggregation scheme that yields high-fidelity HR-HSIs with fewer parameters and faster inference. Experimental results across multiple synthetic and real datasets show MossFuse outperforms existing linear and deep-learning HMIF methods on key metrics (PSNR, SSIM, SAM, ERGAS), while maintaining robustness under varied degradations. The approach offers practical impact for real-world HMIF tasks by delivering accurate, spectrally faithful HR-HSIs with reduced computational costs, and sets the stage for future integration with large-scale pretraining and semantic supervision.

Abstract

Hyperspectral and Multispectral Image Fusion (HMIF) aims to fuse low-resolution hyperspectral images (LR-HSIs) and high-resolution multispectral images (HR-MSIs) to reconstruct high spatial and high spectral resolution images. Current methods typically apply direct fusion from the two modalities without effective supervision, leading to an incomplete perception of deep modality-complementary information and a limited understanding of inter-modality correlations. To address these issues, we propose a simple yet effective solution for unsupervised HMIF, revealing that modality decoupling is key to improving fusion performance. Specifically, we propose an end-to-end self-supervised \textbf{Mo}dality-Decoupled \textbf{S}patial-\textbf{S}pectral Fusion (\textbf{MossFuse}) framework that decouples shared and complementary information across modalities and aggregates a concise representation of both LR-HSIs and HR-MSIs to reduce modality redundancy. Also, we introduce the subspace clustering loss as a clear guide to decouple modality-shared features from modality-complementary ones. Systematic experiments over multiple datasets demonstrate that our simple and effective approach consistently outperforms the existing HMIF methods while requiring considerably fewer parameters with reduced inference time. The anonymous source code is in \href{https://github.com/dusongcheng/MossFuse}{MossFuse}.

Paper Structure

This paper contains 28 sections, 17 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Architecture of MossFuse. (a) Main pipeline, consisting of four key components: I. Multi-Modality Decoupling, II. Self-supervised Constraint, III. Degradation Estimation, and IV. Modality Aggregation; with losses associated with each component being: $\mathcal{L}_{\text{SC}}$ (subspace clustering loss), $\mathcal{L}_{\text{SCT}}$ (self-supervised constraint loss), $\mathcal{L}_{\text{DE}}$ (degradation estimation loss), and $\mathcal{L}_{\text{MA}}$ (modality aggregation loss). (b) and (c) Detailed architecture of Spatial- and Spectral-Aware Aggregation Blocks, respectively.
  • Figure 2: Visual reconstruction results and error images for 11th band of HR-HSI images from CAVE dataset.
  • Figure 3: Visual reconstruction results and error images for 11th band of HR-HSI images from NTIRE2018 dataset.
  • Figure 4: Visual reconstruction results from NCALM (upper part) and WV-2 (bottom part) datasets. We show images with 25-16-7 and 5-3-2 as R-G-B, respectively.
  • Figure 5: Comparison of degradation parameters PSF and SRF estimated by different methods.
  • ...and 13 more figures