Leveraging Large-Scale Pretrained Spatial-Spectral Priors for General Zero-Shot Pansharpening
Yongchuan Cui, Peng Liu, Yi Zeng
TL;DR
This work tackles cross-domain generalization in remote-sensing pansharpening by proposing large-scale simulated pretraining to learn robust spatial-spectral priors. It constructs diverse MS/MSPAN pairs from ImageNet and SkyScript using stochastic degradation and augmentation, enabling zero-shot and one-shot adaptation across multiple satellite sensors. Across CNN, Transformer, and Mamba architectures, pretraining—especially on SkyScript—followed by full-tuning yields strong cross-domain performance, validating the benefit of simulated priors for fusion tasks. The approach offers a practical pathway toward foundation-model-like generalization in pansharpening and establishes benchmarks for cross-sensor fusion under limited real data, with potential extensions to broader remote-sensing tasks and hyperspectral data.
Abstract
Existing deep learning methods for remote sensing image fusion often suffer from poor generalization when applied to unseen datasets due to the limited availability of real training data and the domain gap between different satellite sensors. To address this challenge, we explore the potential of foundation models by proposing a novel pretraining strategy that leverages large-scale simulated datasets to learn robust spatial-spectral priors. Specifically, our approach first constructs diverse simulated datasets by applying various degradation operations (blur, noise, downsampling) and augmentations (bands generation, channel shuffling, high-pass filtering, color jittering, etc.) to natural images from ImageNet and remote sensing images from SkyScript. We then pretrain fusion models on these simulated data to learn generalizable spatial-spectral representations. The pretrained models are subsequently evaluated on six datasets (WorldView-2/3/4, IKONOS, QuickBird, GaoFen-2) using zero-shot and one-shot paradigms, with both full- and freeze-tuning approaches for fine-tuning. Extensive experiments on different network architectures including convolutional neural networks, Transformer, and Mamba demonstrate that our pretraining strategy significantly improves generalization performance across different satellite sensors and imaging conditions for various fusion models. The pretrained models achieve superior results in zero-shot scenarios and show remarkable adaptation capability with minimal real data in one-shot settings. Our work provides a practical solution for cross-domain pansharpening, establishes a new benchmark for generalization in remote sensing image fusion tasks, and paves the way for leveraging foundation models through advanced training strategies.
