Table of Contents
Fetching ...

CrossDiff: Exploring Self-Supervised Representation of Pansharpening via Cross-Predictive Diffusion Model

Yinghui Xing, Litao Qu, Shizhou Zhang, Kai Zhang, Yanning Zhang

TL;DR

CrossDiff tackles pansharpening by learning self-supervised, spatial–spectral representations through a cross-predictive diffusion pretext, then adapting a fusion head with frozen encoders. The two-stage approach—P2M/M2P pretraining followed by fusion-head training—yields strong performance at full and reduced resolutions and demonstrates robust cross-sensor generalization. Empirical results on QB, WV-2, and WV-4 datasets show CrossDiff outperforming both supervised and unsupervised baselines, with ablations confirming the value of cross-predictive pretext and attention-guided fusion. The work advances unsupervised representation learning for remote sensing pan-sharpening and provides a pathway for cross-domain transfer without re-training on new sensors.

Abstract

Fusion of a panchromatic (PAN) image and corresponding multispectral (MS) image is also known as pansharpening, which aims to combine abundant spatial details of PAN and spectral information of MS. Due to the absence of high-resolution MS images, available deep-learning-based methods usually follow the paradigm of training at reduced resolution and testing at both reduced and full resolution. When taking original MS and PAN images as inputs, they always obtain sub-optimal results due to the scale variation. In this paper, we propose to explore the self-supervised representation of pansharpening by designing a cross-predictive diffusion model, named CrossDiff. It has two-stage training. In the first stage, we introduce a cross-predictive pretext task to pre-train the UNet structure based on conditional DDPM, while in the second stage, the encoders of the UNets are frozen to directly extract spatial and spectral features from PAN and MS, and only the fusion head is trained to adapt for pansharpening task. Extensive experiments show the effectiveness and superiority of the proposed model compared with state-of-the-art supervised and unsupervised methods. Besides, the cross-sensor experiments also verify the generalization ability of proposed self-supervised representation learners for other satellite's datasets. We will release our code for reproducibility.

CrossDiff: Exploring Self-Supervised Representation of Pansharpening via Cross-Predictive Diffusion Model

TL;DR

CrossDiff tackles pansharpening by learning self-supervised, spatial–spectral representations through a cross-predictive diffusion pretext, then adapting a fusion head with frozen encoders. The two-stage approach—P2M/M2P pretraining followed by fusion-head training—yields strong performance at full and reduced resolutions and demonstrates robust cross-sensor generalization. Empirical results on QB, WV-2, and WV-4 datasets show CrossDiff outperforming both supervised and unsupervised baselines, with ablations confirming the value of cross-predictive pretext and attention-guided fusion. The work advances unsupervised representation learning for remote sensing pan-sharpening and provides a pathway for cross-domain transfer without re-training on new sensors.

Abstract

Fusion of a panchromatic (PAN) image and corresponding multispectral (MS) image is also known as pansharpening, which aims to combine abundant spatial details of PAN and spectral information of MS. Due to the absence of high-resolution MS images, available deep-learning-based methods usually follow the paradigm of training at reduced resolution and testing at both reduced and full resolution. When taking original MS and PAN images as inputs, they always obtain sub-optimal results due to the scale variation. In this paper, we propose to explore the self-supervised representation of pansharpening by designing a cross-predictive diffusion model, named CrossDiff. It has two-stage training. In the first stage, we introduce a cross-predictive pretext task to pre-train the UNet structure based on conditional DDPM, while in the second stage, the encoders of the UNets are frozen to directly extract spatial and spectral features from PAN and MS, and only the fusion head is trained to adapt for pansharpening task. Extensive experiments show the effectiveness and superiority of the proposed model compared with state-of-the-art supervised and unsupervised methods. Besides, the cross-sensor experiments also verify the generalization ability of proposed self-supervised representation learners for other satellite's datasets. We will release our code for reproducibility.
Paper Structure (17 sections, 7 equations, 5 figures, 3 tables)

This paper contains 17 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Training of our method. Stage 1: Pre-train the UNet structure with a cross-predictive pretext task based on conditional DDPM. Stage 2: Train the fusion head to adapt for pansharpening task.
  • Figure 2: Framework of proposed cross-predictive pretext task, where PAN and upsampled MS images are cross predicted from each other through the reverse process.
  • Figure 3: The pansharpening adaptation stage, where the spatial and spectral features are extracted from the pre-trained encoders, and concatenated to be taken as inputs to the fusion head. The fusion result FMS is then obtained.
  • Figure 4: Fusion results on QuickBird (QB) dataset at the full resolution. One can zoom in for more details.
  • Figure 5: Fusion results on WorldView-2 (WV-2) dataset at the full resolution. One can zoom in for more details.