Table of Contents
Fetching ...

DiffFuSR: Super-Resolution of all Sentinel-2 Multispectral Bands using Diffusion Models

Muhammad Sarmad, Arnt-Børre Salberg, Michael Kampffmeyer

TL;DR

DiffFuSR tackles the challenge of upsampling all Sentinel-2 bands to a common 2.5 m resolution by a two-stage approach: a diffusion-model–based SR for the 10 m RGB bands trained on harmonized cross-domain data, followed by a learned fusion network that upscales the remaining 10/20/60 m bands using the super-resolved RGB as a spatial prior. The method introduces a conditional DDPM with spatial and degradation encoders to enable blind SR and uses a contrastive degradation model to simulate Sentinel-2-like degradations; a Wald-protocol–based self-supervised fusion further preserves spectral fidelity while injecting high-frequency spatial details. Quantitative and qualitative evaluations on OpenSR-test and multiple datasets show that DiffFuSR outperforms baselines in reflectance fidelity, spectral consistency, and hallucination suppression, while the fusion stage delivers competitive improvements for multi-band upscaling. The work demonstrates a practical, modular path to generate a 12-band, 2.5 m Sentinel-2 product from public data, with implications for detailed land monitoring and open EO research.

Abstract

This paper presents DiffFuSR, a modular pipeline for super-resolving all 12 spectral bands of Sentinel-2 Level-2A imagery to a unified ground sampling distance (GSD) of 2.5 meters. The pipeline comprises two stages: (i) a diffusion-based super-resolution (SR) model trained on high-resolution RGB imagery from the NAIP and WorldStrat datasets, harmonized to simulate Sentinel-2 characteristics; and (ii) a learned fusion network that upscales the remaining multispectral bands using the super-resolved RGB image as a spatial prior. We introduce a robust degradation model and contrastive degradation encoder to support blind SR. Extensive evaluations of the proposed SR pipeline on the OpenSR benchmark demonstrate that the proposed method outperforms current SOTA baselines in terms of reflectance fidelity, spectral consistency, spatial alignment, and hallucination suppression. Furthermore, the fusion network significantly outperforms classical and learned pansharpening approaches, enabling accurate enhancement of Sentinel-2's 20 m and 60 m bands. This work proposes a novel modular framework Sentinel-2 SR that utilizes harmonized learning with diffusion models and fusion strategies. Our code and models can be found at https://github.com/NorskRegnesentral/DiffFuSR.

DiffFuSR: Super-Resolution of all Sentinel-2 Multispectral Bands using Diffusion Models

TL;DR

DiffFuSR tackles the challenge of upsampling all Sentinel-2 bands to a common 2.5 m resolution by a two-stage approach: a diffusion-model–based SR for the 10 m RGB bands trained on harmonized cross-domain data, followed by a learned fusion network that upscales the remaining 10/20/60 m bands using the super-resolved RGB as a spatial prior. The method introduces a conditional DDPM with spatial and degradation encoders to enable blind SR and uses a contrastive degradation model to simulate Sentinel-2-like degradations; a Wald-protocol–based self-supervised fusion further preserves spectral fidelity while injecting high-frequency spatial details. Quantitative and qualitative evaluations on OpenSR-test and multiple datasets show that DiffFuSR outperforms baselines in reflectance fidelity, spectral consistency, and hallucination suppression, while the fusion stage delivers competitive improvements for multi-band upscaling. The work demonstrates a practical, modular path to generate a 12-band, 2.5 m Sentinel-2 product from public data, with implications for detailed land monitoring and open EO research.

Abstract

This paper presents DiffFuSR, a modular pipeline for super-resolving all 12 spectral bands of Sentinel-2 Level-2A imagery to a unified ground sampling distance (GSD) of 2.5 meters. The pipeline comprises two stages: (i) a diffusion-based super-resolution (SR) model trained on high-resolution RGB imagery from the NAIP and WorldStrat datasets, harmonized to simulate Sentinel-2 characteristics; and (ii) a learned fusion network that upscales the remaining multispectral bands using the super-resolved RGB image as a spatial prior. We introduce a robust degradation model and contrastive degradation encoder to support blind SR. Extensive evaluations of the proposed SR pipeline on the OpenSR benchmark demonstrate that the proposed method outperforms current SOTA baselines in terms of reflectance fidelity, spectral consistency, spatial alignment, and hallucination suppression. Furthermore, the fusion network significantly outperforms classical and learned pansharpening approaches, enabling accurate enhancement of Sentinel-2's 20 m and 60 m bands. This work proposes a novel modular framework Sentinel-2 SR that utilizes harmonized learning with diffusion models and fusion strategies. Our code and models can be found at https://github.com/NorskRegnesentral/DiffFuSR.

Paper Structure

This paper contains 58 sections, 21 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Overview of the proposed pipeline combining RGB SR and multispectral fusion. The top branch illustrates the training of a denoising diffusion model using synthetic low-resolution RGB images generated through harmonization, blurring, downsampling, and noise addition. The model learns to reconstruct HR RGB images. The bottom branch depicts the fusion process: multispectral Sentinel-2 bands are fused with the super-resolved RGB output via a learned fusion module to produce full-resolution 12-band output.
  • Figure 2: Architecture of the proposed conditional denoising diffusion model for RGB SR. The model learns to reconstruct a HR RGB image $\mathbf{x}_0$ from a noisy latent $\mathbf{x}_T$ via a denoising process modeled as a Markov chain. The input image is first degraded with Gaussian noise and spatially rearranged using a pixel-folding operation. A spatial encoder extracts contextual features $\mathbf{u}$ from the low-resolution input, while a degradation encoder produces a global degradation embedding $\mathbf{v}$. Both $\mathbf{u}$ and $\mathbf{v}$ are used to condition the denoising network at each reverse time step $t$. The sequence $\mathbf{x}_T \rightarrow \dots \rightarrow \mathbf{x}_0$ is denoised through a learned reverse diffusion process. The final output is reshaped using a pixel shuffle operation to recover the super-resolved RGB image.
  • Figure 3: Architecture of the multispectral fusion pipeline. At test time, each resolution group of 10 m, 20 m, and 60 m bands is processed by a dedicated fusion module that receives the native-resolution multispectral inputs along with the RGB image super-resolved to 2.5 m. These modules independently reconstruct each group at 2.5 m GSD. The outputs are then merged to produce a consistent 12-band super-resolved Sentinel-2 image. This process is applied during both training (using synthetic degraded inputs) and inference (using native-resolution bands), with shared architecture and supervision strategy across all groups. The scaling factor for degraded input for training is computed using the Wald protocol. For example, for the 60 m group, we aim to learn the mapping from 60 m to 2.5 m given a 2.5 m SR guidance signal. This corresponds to a 24× scale factor. During training, 60m bands are degraded by this factor, producing input resolutions of 1440 m. At training time, the network reconstructs the degraded signal to 60 m GSD, supervised by the corresponding 60 m ground-truth reference.
  • Figure 4: Architecture of the proposed GLP-inspired vivone2018full fusion module. Each model variant for the 10 m, 20 m, and 60 m branches contains approximately 202k trainable parameters. The symbols denote basic operations: "+" indicates element-wise addition, "$-$" indicates subtraction, "$\times$" indicates element-wise multiplication, and "c" indicates channel-wise concatenation.
  • Figure 5: Qualitative comparison of RGB SR results (vegetated scene). Rows correspond to models trained on harmonized NAIP, unharmonized NAIP, and WorldStrat. Top sub-row shows image outputs: LR input, bicubic downsampled variant, SR result, harmonized SR, and ground truth. Bottom sub-row shows reflectance consistency (L1), spectral consistency (SAD), and distances to omission, hallucination, and improvement spaces. The harmonized NAIP model yields the best visual and quantitative performance in this vegetated context.
  • ...and 6 more figures