Table of Contents
Fetching ...

SALAD-Pan: Sensor-Agnostic Latent Adaptive Diffusion for Pan-Sharpening

Junjie Li, Congyang Ou, Haokui Zhang, Guoting Wei, Shengqin Jiang, Ying Li, Chunhua Shen

TL;DR

SALAD-Pan addresses cross-sensor pansharpening by performing diffusion in a latent space learned with a band-wise single-channel VAE, enabling sensor-agnostic processing across varying MS band configurations. It couples PAN-driven spatial guidance and upsampled LRMS-driven spectral guidance through bidirectional encoder interactions and frequency-split fusion, augmented with sensor-aware text prompts and a lightweight cross-band attention module. The method delivers state-of-the-art results on PanCollection sensors GF2, QB, and WV3, while achieving 2–3× faster inference and robust zero-shot transfer to WV2. These contributions demonstrate that latent-space diffusion, together with disentangled conditioning and cross-band coherence, provides a practical, scalable solution for high-fidelity pan-sharpening in multi-sensor remote sensing pipelines.

Abstract

Recently, diffusion models bring novel insights for Pan-sharpening and notably boost fusion precision. However, most existing models perform diffusion in the pixel space and train distinct models for different multispectral (MS) imagery, suffering from high latency and sensor-specific limitations. In this paper, we present SALAD-Pan, a sensor-agnostic latent space diffusion method for efficient pansharpening. Specifically, SALAD-Pan trains a band-wise single-channel VAE to encode high-resolution multispectral (HRMS) into compact latent representations, supporting MS images with various channel counts and establishing a basis for acceleration. Then spectral physical properties, along with PAN and MS images, are injected into the diffusion backbone through unidirectional and bidirectional interactive control structures respectively, achieving high-precision fusion in the diffusion process. Finally, a lightweight cross-spectral attention module is added to the central layer of diffusion model, reinforcing spectral connections to boost spectral consistency and further elevate fusion precision. Experimental results on GaoFen-2, QuickBird, and WorldView-3 demonstrate that SALAD-Pan outperforms state-of-the-art diffusion-based methods across all three datasets, attains a 2-3x inference speedup, and exhibits robust zero-shot (cross-sensor) capability.

SALAD-Pan: Sensor-Agnostic Latent Adaptive Diffusion for Pan-Sharpening

TL;DR

SALAD-Pan addresses cross-sensor pansharpening by performing diffusion in a latent space learned with a band-wise single-channel VAE, enabling sensor-agnostic processing across varying MS band configurations. It couples PAN-driven spatial guidance and upsampled LRMS-driven spectral guidance through bidirectional encoder interactions and frequency-split fusion, augmented with sensor-aware text prompts and a lightweight cross-band attention module. The method delivers state-of-the-art results on PanCollection sensors GF2, QB, and WV3, while achieving 2–3× faster inference and robust zero-shot transfer to WV2. These contributions demonstrate that latent-space diffusion, together with disentangled conditioning and cross-band coherence, provides a practical, scalable solution for high-fidelity pan-sharpening in multi-sensor remote sensing pipelines.

Abstract

Recently, diffusion models bring novel insights for Pan-sharpening and notably boost fusion precision. However, most existing models perform diffusion in the pixel space and train distinct models for different multispectral (MS) imagery, suffering from high latency and sensor-specific limitations. In this paper, we present SALAD-Pan, a sensor-agnostic latent space diffusion method for efficient pansharpening. Specifically, SALAD-Pan trains a band-wise single-channel VAE to encode high-resolution multispectral (HRMS) into compact latent representations, supporting MS images with various channel counts and establishing a basis for acceleration. Then spectral physical properties, along with PAN and MS images, are injected into the diffusion backbone through unidirectional and bidirectional interactive control structures respectively, achieving high-precision fusion in the diffusion process. Finally, a lightweight cross-spectral attention module is added to the central layer of diffusion model, reinforcing spectral connections to boost spectral consistency and further elevate fusion precision. Experimental results on GaoFen-2, QuickBird, and WorldView-3 demonstrate that SALAD-Pan outperforms state-of-the-art diffusion-based methods across all three datasets, attains a 2-3x inference speedup, and exhibits robust zero-shot (cross-sensor) capability.
Paper Structure (62 sections, 60 equations, 12 figures, 8 tables)

This paper contains 62 sections, 60 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Overview of SALAD-Pan.Stage I trains a band-wise single-channel VAE to map each HRMS band into a compact latent space. Stage II performs band-wise conditional latent diffusion with disentangled spatial--spectral conditioning: a spatial branch encodes PAN for spatial guidance, while a spectral branch encodes the upsampled LRMS band-by-band for spectral guidance. We use hybrid coupling: bidirectional interaction in the encoder and unidirectional (branch$\rightarrow$backbone) control in the mid block and decoder. RCBA improves inter-band consistency, and sensor-aware metadata prompts from a frozen CLIP text encoder provide additional conditioning.
  • Figure 2: At each resolution, a PAN-driven spatial control branch and an LRMS-driven spectral control branch couple with the main trunk via GLU zero convolution residual adapters: bidirectional only in encoder, and branch$\rightarrow$trunk in mid/decoder. Residuals are fused by Frequency-split injection (Sec. \ref{['sec:hybrid_coupling_f_split']}).
  • Figure 3: Visual comparison on WorldView-3 (WV-3) and QuickBird (QB) dataset at reduced resolution (RR).
  • Figure 4: Visual comparison on WorldView-3 (WV-3) and QuickBird (QB) dataset at full resolution (FR).
  • Figure 5: Sensor-dependent PAN--MS mixing on PanCollection. Heatmap of ridge-regression coefficients $\{w_b^{(S)}\}$ in Eq. \ref{['eq:pan_mixing_surrogate_appB']} fitted from aligned $\tilde{M}_b$ to PAN $P$. Rows denote sensors and columns denote MS bands $b\in\{C,B,G,Y,R,RE,N1,N2\}$. Blank entries indicate bands unavailable for that sensor. Coefficients are normalized within each sensor for visualization, and the cross-row differences reveal substantial sensor dependence in the effective PAN--MS coupling.
  • ...and 7 more figures