Table of Contents
Fetching ...

Universal Pansharpening Foundation Model

Hebaixu Wang, Jing Zhang, Haonan Guo, Di Wang, Jiayi Ma, Bo Du, Liangpei Zhang

TL;DR

A modality-interleaved transformer that learns band-wise modal specializations to form reversible spectral affine bases is introduced, mapping arbitrary-band MS into a unified latent space via tensor multiplication, and a latent diffusion bridge model is constructed to progressively evolve latent representations.

Abstract

Pansharpening generates the high-resolution multi-spectral (MS) image by integrating spatial details from a texture-rich panchromatic (PAN) image and spectral attributes from a low-resolution MS image. Existing methods are predominantly satellite-specific and scene-dependent, which severely limits their generalization across heterogeneous sensors and varied scenes, thereby reducing their real-world practicality. To address these challenges, we present FoundPS, a universal pansharpening foundation model for satellite-agnostic and scene-robust fusion. Specifically, we introduce a modality-interleaved transformer that learns band-wise modal specializations to form reversible spectral affine bases, mapping arbitrary-band MS into a unified latent space via tensor multiplication. Building upon this, we construct a latent diffusion bridge model to progressively evolve latent representations, and incorporate bridge posterior sampling to couple latent diffusion with pixel-space observations, enabling stable and controllable fusion. Furthermore, we devise infinite-dimensional pixel-to-latent interaction mechanisms to comprehensively capture the cross-domain dependencies between PAN observations and MS representations, thereby facilitating complementary information fusion. In addition, to support large-scale training and evaluation, we construct a comprehensive pansharpening benchmark, termed PSBench, consisting of worldwide MS and PAN image pairs from multiple satellites across diverse scenes. Extensive experiments demonstrate that FoundPS consistently outperforms state-of-the-art methods, exhibiting superior generalization and robustness across a wide range of pansharpening tasks.

Universal Pansharpening Foundation Model

TL;DR

A modality-interleaved transformer that learns band-wise modal specializations to form reversible spectral affine bases is introduced, mapping arbitrary-band MS into a unified latent space via tensor multiplication, and a latent diffusion bridge model is constructed to progressively evolve latent representations.

Abstract

Pansharpening generates the high-resolution multi-spectral (MS) image by integrating spatial details from a texture-rich panchromatic (PAN) image and spectral attributes from a low-resolution MS image. Existing methods are predominantly satellite-specific and scene-dependent, which severely limits their generalization across heterogeneous sensors and varied scenes, thereby reducing their real-world practicality. To address these challenges, we present FoundPS, a universal pansharpening foundation model for satellite-agnostic and scene-robust fusion. Specifically, we introduce a modality-interleaved transformer that learns band-wise modal specializations to form reversible spectral affine bases, mapping arbitrary-band MS into a unified latent space via tensor multiplication. Building upon this, we construct a latent diffusion bridge model to progressively evolve latent representations, and incorporate bridge posterior sampling to couple latent diffusion with pixel-space observations, enabling stable and controllable fusion. Furthermore, we devise infinite-dimensional pixel-to-latent interaction mechanisms to comprehensively capture the cross-domain dependencies between PAN observations and MS representations, thereby facilitating complementary information fusion. In addition, to support large-scale training and evaluation, we construct a comprehensive pansharpening benchmark, termed PSBench, consisting of worldwide MS and PAN image pairs from multiple satellites across diverse scenes. Extensive experiments demonstrate that FoundPS consistently outperforms state-of-the-art methods, exhibiting superior generalization and robustness across a wide range of pansharpening tasks.
Paper Structure (29 sections, 26 equations, 24 figures, 12 tables, 2 algorithms)

This paper contains 29 sections, 26 equations, 24 figures, 12 tables, 2 algorithms.

Figures (24)

  • Figure 1: Pansharpening performance comparisons on PSBench across representative spectral band configurations (4-, 7-, 8-, and 10-band). The upper and lower panels show the reduced- and full-scale evaluations using PSNR and the non-reference QNR metric, respectively. FoundPS consistently outperforms the advanced task-specific models, offering a universal solution for pansharpening tasks.
  • Figure 2: Mainstream pansharpening paradigms: (a) Satellite-specific methods employ independent encoder–decoder architectures per spectral configuration, resulting in parameter redundancy and poor cross-satellite scalability. (b) Band-truncated methods unify data format by selecting a fixed band subset, enabling joint training but discarding spectral information and losing the ability to process excluded bands. (c) Our universal method performs band-agnostic fusion by projecting arbitrary-band MS images into a shared latent space, achieving unified modeling without band truncation or parameter duplication, while preserving full cross-band and cross-modal information.
  • Figure 3: Satellite observation locations and distribution of PSBench. It contains four representative spectral configurations for pansharpening tasks with over seventeen landcover categories.
  • Figure 4: The overall framework of the proposed FoundPS.
  • Figure 5: The network architecture of Infinite-UNet in latent diffusion bridge model.
  • ...and 19 more figures