Table of Contents
Fetching ...

Fose: Fusion of One-Step Diffusion and End-to-End Network for Pansharpening

Kai Liu, Zeli Lin, Weibo Wang, Linghe Kong, Yulun Zhang

TL;DR

Fose tackles pansharpening by blending a one-step diffusion model with a compact end-to-end network through a four-stage training pipeline. By distilling a multi-step diffusion baseline into a single step and fusing it with an E2E model via a lightweight adaptor, it achieves a 7.42x speedup while attaining state-of-the-art accuracy on WV3, GF2, and QB. Comprehensive experiments and ablations confirm robust gains across reduced- and full-resolution metrics and highlight the value of adaptive convolution, distillation, and fusion strategies. This work demonstrates the practical viability of diffusion priors in efficient, high-fidelity remote sensing image fusion.

Abstract

Pansharpening is a significant image fusion task that fuses low-resolution multispectral images (LRMSI) and high-resolution panchromatic images (PAN) to obtain high-resolution multispectral images (HRMSI). The development of the diffusion models (DM) and the end-to-end models (E2E model) has greatly improved the frontier of pansharping. DM takes the multi-step diffusion to obtain an accurate estimation of the residual between LRMSI and HRMSI. However, the multi-step process takes large computational power and is time-consuming. As for E2E models, their performance is still limited by the lack of prior and simple structure. In this paper, we propose a novel four-stage training strategy to obtain a lightweight network Fose, which fuses one-step DM and an E2E model. We perform one-step distillation on an enhanced SOTA DM for pansharping to compress the inference process from 50 steps to only 1 step. Then we fuse the E2E model with one-step DM with lightweight ensemble blocks. Comprehensive experiments are conducted to demonstrate the significant improvement of the proposed Fose on three commonly used benchmarks. Moreover, we achieve a 7.42 speedup ratio compared to the baseline DM while achieving much better performance. The code and model are released at https://github.com/Kai-Liu001/Fose.

Fose: Fusion of One-Step Diffusion and End-to-End Network for Pansharpening

TL;DR

Fose tackles pansharpening by blending a one-step diffusion model with a compact end-to-end network through a four-stage training pipeline. By distilling a multi-step diffusion baseline into a single step and fusing it with an E2E model via a lightweight adaptor, it achieves a 7.42x speedup while attaining state-of-the-art accuracy on WV3, GF2, and QB. Comprehensive experiments and ablations confirm robust gains across reduced- and full-resolution metrics and highlight the value of adaptive convolution, distillation, and fusion strategies. This work demonstrates the practical viability of diffusion priors in efficient, high-fidelity remote sensing image fusion.

Abstract

Pansharpening is a significant image fusion task that fuses low-resolution multispectral images (LRMSI) and high-resolution panchromatic images (PAN) to obtain high-resolution multispectral images (HRMSI). The development of the diffusion models (DM) and the end-to-end models (E2E model) has greatly improved the frontier of pansharping. DM takes the multi-step diffusion to obtain an accurate estimation of the residual between LRMSI and HRMSI. However, the multi-step process takes large computational power and is time-consuming. As for E2E models, their performance is still limited by the lack of prior and simple structure. In this paper, we propose a novel four-stage training strategy to obtain a lightweight network Fose, which fuses one-step DM and an E2E model. We perform one-step distillation on an enhanced SOTA DM for pansharping to compress the inference process from 50 steps to only 1 step. Then we fuse the E2E model with one-step DM with lightweight ensemble blocks. Comprehensive experiments are conducted to demonstrate the significant improvement of the proposed Fose on three commonly used benchmarks. Moreover, we achieve a 7.42 speedup ratio compared to the baseline DM while achieving much better performance. The code and model are released at https://github.com/Kai-Liu001/Fose.

Paper Structure

This paper contains 15 sections, 11 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Performance comparison with SOTA methods on WV3. Fose achieves robust performance across all metrics.
  • Figure 2: Architecture of the proposed Fose. Fose consists of three components, including the OSD, E2E model, and an ensemble connector. The E2E model adopts an ODE proximal network structure, which obtains excellent performance with a relatively small size. The OSD model consists of two branches to process MSI and PAN separately and fuse them with APFM, an attention-like structure. The ensemble connector network leverages lightweight convolutional blocks to fuse the output of two models to obtain the final output.
  • Figure 3: Visual comparison with SOTA pansharpening methods. We provide both the MSI and the residual between the fused image and the GT for better visualization. The average residual of Fose is visually smaller than other methods, validating the effectiveness of Fose.