Table of Contents
Fetching ...

Fast Kernel-Space Diffusion for Remote Sensing Pansharpening

Hancong Jin, Zihan Cao, Liang-jian Deng, Jingjing Li

TL;DR

This work tackles pansharpening by combining the strengths of diffusion models with fast CNN-based regression. It introduces KSDiff, which generates diffusion-informed convolutional kernels in latent space using a kernel generator guided by a latent diffusion prior, enabling global context integration without the latency of full-pixel diffusion. A two-stage training protocol is paired with a Pyramid Latent Fusion Encoder to fuse PAN, LRMS, and HRMS priors, and a structure-aware multi-head attention mechanism governs kernel modulation via a low-rank Tucker decomposition. Empirical results on WV3, GF2, and QB show competitive or superior fusion quality with inference speeds orders of magnitude faster than diffusion baselines. The approach generalizes across backbones and datasets, offering a practical, scalable solution for remote-sensing image fusion with strong spectral–spatial fidelity.

Abstract

Pansharpening seeks to fuse high-resolution panchromatic (PAN) and low-resolution multispectral (LRMS) images into a single image with both fine spatial and rich spectral detail. Despite progress in deep learning-based approaches, existing methods often fail to capture global priors inherent in remote sensing data distributions. Diffusion-based models have recently emerged as promising solutions due to their powerful distribution mapping capabilities, however, they suffer from heavy inference latency. We introduce KSDiff, a fast kernel-space diffusion framework that generates convolutional kernels enriched with global context to enhance pansharpening quality and accelerate inference. Specifically, KSDiff constructs these kernels through the integration of a low-rank core tensor generator and a unified factor generator, orchestrated by a structure-aware multi-head attention mechanism. We further introduce a two-stage training strategy tailored for pansharpening, facilitating integration into existing pansharpening architectures. Experiments show that KSDiff achieves superior performance compared to recent promising methods, and with over $500 \times$ faster inference than diffusion-based pansharpening baselines. Ablation studies, visualizations and further evaluations substantiate the effectiveness of our approach. Code will be released upon possible acceptance.

Fast Kernel-Space Diffusion for Remote Sensing Pansharpening

TL;DR

This work tackles pansharpening by combining the strengths of diffusion models with fast CNN-based regression. It introduces KSDiff, which generates diffusion-informed convolutional kernels in latent space using a kernel generator guided by a latent diffusion prior, enabling global context integration without the latency of full-pixel diffusion. A two-stage training protocol is paired with a Pyramid Latent Fusion Encoder to fuse PAN, LRMS, and HRMS priors, and a structure-aware multi-head attention mechanism governs kernel modulation via a low-rank Tucker decomposition. Empirical results on WV3, GF2, and QB show competitive or superior fusion quality with inference speeds orders of magnitude faster than diffusion baselines. The approach generalizes across backbones and datasets, offering a practical, scalable solution for remote-sensing image fusion with strong spectral–spatial fidelity.

Abstract

Pansharpening seeks to fuse high-resolution panchromatic (PAN) and low-resolution multispectral (LRMS) images into a single image with both fine spatial and rich spectral detail. Despite progress in deep learning-based approaches, existing methods often fail to capture global priors inherent in remote sensing data distributions. Diffusion-based models have recently emerged as promising solutions due to their powerful distribution mapping capabilities, however, they suffer from heavy inference latency. We introduce KSDiff, a fast kernel-space diffusion framework that generates convolutional kernels enriched with global context to enhance pansharpening quality and accelerate inference. Specifically, KSDiff constructs these kernels through the integration of a low-rank core tensor generator and a unified factor generator, orchestrated by a structure-aware multi-head attention mechanism. We further introduce a two-stage training strategy tailored for pansharpening, facilitating integration into existing pansharpening architectures. Experiments show that KSDiff achieves superior performance compared to recent promising methods, and with over faster inference than diffusion-based pansharpening baselines. Ablation studies, visualizations and further evaluations substantiate the effectiveness of our approach. Code will be released upon possible acceptance.

Paper Structure

This paper contains 37 sections, 10 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: (a) Traditional DL-based methods directly learn a non-linear mapping $f_{\theta}$ to fuse PAN and LRMS images in a one-step manner. (b) Recent diffusion-based methods employ a multi-step refinement process in the pixel space conditioned on PAN and LRMS from pure Gaussian noise $\mathbf{\mathcal{N}}$(0, I). The $q(\mathbf{x}_{t}|\mathbf{x}_{t-1})$, $p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t}, \mathbf{c})$ and $\mathbf{c}$ denote the noise adding process, the reverse denoising process, and the condition, respectively. (c) The proposed KSDiff, which generates convolution kernels to enhance regression-based pansharpening networks via a diffusion model performing in the latent space. This design enables high-quality reconstruction with fast inference.
  • Figure 2: Kernel Generator of our proposed KSDiff. The kernel generator comprises two sub-modules: (1) a diffusion model-driven convolutional core generator, (2) a unified factor generator that takes feature maps as input. The outputs of these two modules are integrated using a structure-aware multi-head attention mechanism to reparameterize the base kernel. Note that in the pre-training stage, the latent representation $\mathbf{z}$ is the output of an encoder; in the diffusion model training stage and at inference time, this representation is generated by the diffusion model, namely $\hat{\mathbf{z}}_{0}$.
  • Figure 3: Pyramid Latent Fusion Encoder (PLFE). The figure only shows the structure of $\mathrm{PLFE}_1$; since $\mathrm{PLFE}_2$ only takes PAN and LRMS as input, its structure is slightly different (a halved version compared to $\mathrm{PLFE}_1$). Further details can be found in supplementary materials \ref{['sec: further methods']}.
  • Figure 4: An overview of our two-stage training procedure and inference process. (a) Pre-training Stage: The goal is to extract a latent representation $\mathbf{z}$ by optimizing $\mathrm{PLFE}_1$ jointly with the kernel generator and the pansharpening network. (b) Diffusion Model Training Stage: The latent representation $\mathbf{z}_0$, extracted by the pre-trained $\mathrm{PLFE}_1$, is predicted by leveraging the strong distribution estimation ability of the diffusion model (no ground truth is used as input for its denoising network). (c) Inference: Only $\mathrm{PLFE}_2$, the reverse diffusion process, the kernel generator, and the pansharpening network are involved.
  • Figure 5: Comparison of qualitative results for representative methods on the GF2 reduced-resolution dataset. The first row displays RGB outputs, and the second row presents the error maps.
  • ...and 9 more figures