Table of Contents
Fetching ...

Semantic Guided Large Scale Factor Remote Sensing Image Super-resolution with Generative Diffusion Prior

Ce Wang, Wanjie Sun

TL;DR

This work tackles the problem of recovering high-resolution remote sensing imagery from severely downsampled inputs across different sensors. It introduces the Semantic Guided Diffusion Model (SGDM), which uses a latent diffusion process guided by content from vector maps and style from HR guidance or style sampling to produce semantically accurate and texturally rich SR images. A style-correcting component (SCM) and a style-distribution model (SFlow) enable diverse outputs and robustness to sensor-induced style gaps, while a new CMSRD dataset provides synthetic and real-world RS pairs for evaluation. The results show SGDM and its SCM variant outperform competing methods on perceptual and downstream-vision tasks, demonstrating practical impact for large-scale RS SR and downstream analyses such as scene recognition and semantic segmentation.

Abstract

Remote sensing images captured by different platforms exhibit significant disparities in spatial resolution. Large scale factor super-resolution (SR) algorithms are vital for maximizing the utilization of low-resolution (LR) satellite data captured from orbit. However, existing methods confront challenges in recovering SR images with clear textures and correct ground objects. We introduce a novel framework, the Semantic Guided Diffusion Model (SGDM), designed for large scale factor remote sensing image super-resolution. The framework exploits a pre-trained generative model as a prior to generate perceptually plausible SR images. We further enhance the reconstruction by incorporating vector maps, which carry structural and semantic cues. Moreover, pixel-level inconsistencies in paired remote sensing images, stemming from sensor-specific imaging characteristics, may hinder the convergence of the model and diversity in generated results. To address this problem, we propose to extract the sensor-specific imaging characteristics and model the distribution of them, allowing diverse SR images generation based on imaging characteristics provided by reference images or sampled from the imaging characteristic probability distributions. To validate and evaluate our approach, we create the Cross-Modal Super-Resolution Dataset (CMSRD). Qualitative and quantitative experiments on CMSRD showcase the superiority and broad applicability of our method. Experimental results on downstream vision tasks also demonstrate the utilitarian of the generated SR images. The dataset and code will be publicly available at https://github.com/wwangcece/SGDM

Semantic Guided Large Scale Factor Remote Sensing Image Super-resolution with Generative Diffusion Prior

TL;DR

This work tackles the problem of recovering high-resolution remote sensing imagery from severely downsampled inputs across different sensors. It introduces the Semantic Guided Diffusion Model (SGDM), which uses a latent diffusion process guided by content from vector maps and style from HR guidance or style sampling to produce semantically accurate and texturally rich SR images. A style-correcting component (SCM) and a style-distribution model (SFlow) enable diverse outputs and robustness to sensor-induced style gaps, while a new CMSRD dataset provides synthetic and real-world RS pairs for evaluation. The results show SGDM and its SCM variant outperform competing methods on perceptual and downstream-vision tasks, demonstrating practical impact for large-scale RS SR and downstream analyses such as scene recognition and semantic segmentation.

Abstract

Remote sensing images captured by different platforms exhibit significant disparities in spatial resolution. Large scale factor super-resolution (SR) algorithms are vital for maximizing the utilization of low-resolution (LR) satellite data captured from orbit. However, existing methods confront challenges in recovering SR images with clear textures and correct ground objects. We introduce a novel framework, the Semantic Guided Diffusion Model (SGDM), designed for large scale factor remote sensing image super-resolution. The framework exploits a pre-trained generative model as a prior to generate perceptually plausible SR images. We further enhance the reconstruction by incorporating vector maps, which carry structural and semantic cues. Moreover, pixel-level inconsistencies in paired remote sensing images, stemming from sensor-specific imaging characteristics, may hinder the convergence of the model and diversity in generated results. To address this problem, we propose to extract the sensor-specific imaging characteristics and model the distribution of them, allowing diverse SR images generation based on imaging characteristics provided by reference images or sampled from the imaging characteristic probability distributions. To validate and evaluate our approach, we create the Cross-Modal Super-Resolution Dataset (CMSRD). Qualitative and quantitative experiments on CMSRD showcase the superiority and broad applicability of our method. Experimental results on downstream vision tasks also demonstrate the utilitarian of the generated SR images. The dataset and code will be publicly available at https://github.com/wwangcece/SGDM
Paper Structure (32 sections, 10 equations, 17 figures, 3 tables)

This paper contains 32 sections, 10 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: In real-world scenarios, there exists a significant spatial resolution disparity among remote-sensing images captured by different sensors. Vector maps, on the other hand, provide rich semantic guidance for large scale factor super-resolution.
  • Figure 2: Main idea of our work. Unlike traditional super-resolution methods that only use LR, our approach additionally incorporates content conditions and style conditions. Content guidance can be provided by vector maps, while style guidance can be achieved through style-guided images or style sampling from HR style space.
  • Figure 3: The framework of SGDM. A VAE is used to shift the diffusion process and reverse process from the pixel space to the latent space. During training, the latent features $z_{0}$ of HR are transformed into $z_{t}$ through the diffusion process, which is then denoised through a U-Net network. To guide the denoising process, we design two modules: the content-style encoder (CS-Encoder) and the adapter. The former integrates information from LR, vector maps, and style guidance images to generate conditional features $F_{\mathrm{cond}}$. The latter generates multi-scale features based on $F_{\mathrm{cond}}$ and performs element-wise addition with the output of the corresponding layers in the U-Net.
  • Figure 4: Detailed structure of the proposed Content-Style Encoder (CS-Encoder). It can utilize the content information from the vector map at multiple scales through the SPADE module, and achieve style injection through the AdaIN module, which results in a conditional feature $F_{\mathrm{cond}}$. This feature encapsulates semantic information and style attributes.
  • Figure 5: Structures of SPADE block, basic block and downsample block in our work.
  • ...and 12 more figures