Table of Contents
Fetching ...

Effective Cloud Removal for Remote Sensing Images by an Improved Mean-Reverting Denoising Model with Elucidated Design Space

Yi Liu, Wengen Li, Jihong Guan, Shuigeng Zhou, Yichao Zhang

TL;DR

The paper addresses cloud removal in remote sensing by introducing EMRDM, an improved mean-reverting diffusion model that starts diffusion from cloudy inputs via a forward $SDE$ and reconstructs cloudless images with an $ODE$-based backward process. It offers a modular, elucidated design space by reformulating the forward process and redefining the denoiser through a preconditioning framework, enabling independent module improvements and compatibility with generative diffusion methods. A novel multi-temporal denoising network denises sequential cloudy images in parallel using temporal fusion attention, enhancing restoration across time. Comprehensive experiments on mono-temporal and multi-temporal datasets show EMRDM achieving state-of-the-art performance, validating the framework’s effectiveness and practicality for high-fidelity CR in diverse remote-sensing scenarios. The work provides code for reproducibility and demonstrates strong potential for deployment in real-time CR tasks.

Abstract

Cloud removal (CR) remains a challenging task in remote sensing image processing. Although diffusion models (DM) exhibit strong generative capabilities, their direct applications to CR are suboptimal, as they generate cloudless images from random noise, ignoring inherent information in cloudy inputs. To overcome this drawback, we develop a new CR model EMRDM based on mean-reverting diffusion models (MRDMs) to establish a direct diffusion process between cloudy and cloudless images. Compared to current MRDMs, EMRDM offers a modular framework with updatable modules and an elucidated design space, based on a reformulated forward process and a new ordinary differential equation (ODE)-based backward process. Leveraging our framework, we redesign key MRDM modules to boost CR performance, including restructuring the denoiser via a preconditioning technique, reorganizing the training process, and improving the sampling process by introducing deterministic and stochastic samplers. To achieve multi-temporal CR, we further develop a denoising network for simultaneously denoising sequential images. Experiments on mono-temporal and multi-temporal datasets demonstrate the superior performance of EMRDM. Our code is available at https://github.com/Ly403/EMRDM.

Effective Cloud Removal for Remote Sensing Images by an Improved Mean-Reverting Denoising Model with Elucidated Design Space

TL;DR

The paper addresses cloud removal in remote sensing by introducing EMRDM, an improved mean-reverting diffusion model that starts diffusion from cloudy inputs via a forward and reconstructs cloudless images with an -based backward process. It offers a modular, elucidated design space by reformulating the forward process and redefining the denoiser through a preconditioning framework, enabling independent module improvements and compatibility with generative diffusion methods. A novel multi-temporal denoising network denises sequential cloudy images in parallel using temporal fusion attention, enhancing restoration across time. Comprehensive experiments on mono-temporal and multi-temporal datasets show EMRDM achieving state-of-the-art performance, validating the framework’s effectiveness and practicality for high-fidelity CR in diverse remote-sensing scenarios. The work provides code for reproducibility and demonstrates strong potential for deployment in real-time CR tasks.

Abstract

Cloud removal (CR) remains a challenging task in remote sensing image processing. Although diffusion models (DM) exhibit strong generative capabilities, their direct applications to CR are suboptimal, as they generate cloudless images from random noise, ignoring inherent information in cloudy inputs. To overcome this drawback, we develop a new CR model EMRDM based on mean-reverting diffusion models (MRDMs) to establish a direct diffusion process between cloudy and cloudless images. Compared to current MRDMs, EMRDM offers a modular framework with updatable modules and an elucidated design space, based on a reformulated forward process and a new ordinary differential equation (ODE)-based backward process. Leveraging our framework, we redesign key MRDM modules to boost CR performance, including restructuring the denoiser via a preconditioning technique, reorganizing the training process, and improving the sampling process by introducing deterministic and stochastic samplers. To achieve multi-temporal CR, we further develop a denoising network for simultaneously denoising sequential images. Experiments on mono-temporal and multi-temporal datasets demonstrate the superior performance of EMRDM. Our code is available at https://github.com/Ly403/EMRDM.

Paper Structure

This paper contains 32 sections, 53 equations, 12 figures, 6 tables, 3 algorithms.

Figures (12)

  • Figure 1: Comparison of EMRDM (c) with generative DMs (a) and MRDMs (b). Here, target is the cloudless image, pred is the CR prediction result, mean is the cloudy image, and noisy mean is the noisy cloudy image. The forward processes of (a), (b), and (c) generate diffused images approximated by noise (for DMs) and noisy mean (for EMRDM and MRDMs), respectively.
  • Figure 2: (a) The EMRDM framework comprises a forward process and a backward process that contains a denoiser. (b) The denoiser consists primarily of a denoising network, where the preconditioning module generates reparameterized factors $c_{\text{in}}\left(\sigma\right),c_{\text{out}}\left(\sigma\right),c_{\text{skip}}\left(\sigma\right),c_{\text{noise}}\left(\sigma\right)$ based on noise level $\sigma(t)$. We show the multi-temporal condition with the sequence length $L$.
  • Figure 3: Illustration of the denoising network. (a) The network concurrently denoises sequences of noisy cloudy images (noisy mean), cloudy images (mean), and optional auxiliary modal images (aux) to generate results (pred). The notation $\times L$ indicates $L$ weight-sharing copies. (b) We extend the original HDiT Blocks to THDiT Blocks to integrate temporal information. (c) TFSA collapses the temporal dimension of inputs and generates the attention masks. For simplicity, we present a single-head scenario. Feature map dimensions are indicated below each block, where $N$ is the batch size, $H$ is the height, $W$ is the width, $L$ is the sequence length, $C$ is the channels of feature maps, $G$ is the number of heads, $d_c$ is the channels of condition vectors, and $d_k$ is the channels of query and key matrices.
  • Figure 4: (a) SEN12MS-CR dataset results: RGB channels for optical imagery (linearly enhanced for visualization) and VV channel for SAR imagery. GLF-CR results are obtained by combining four separately processed subimages as it processes $128\times128$ images ($256\times256$ for others). (b) Sen2_MTC_New dataset results. (c,d) RGB channel results on CUHK-CR1 and CUHK-CR2 datasets, respectively.
  • Figure 5: Analysis of our samplers on the Sen2_MTC_New dataset. When $S_\text{churn}=0$, the sampler reduces to be deterministic. The upper row shows the effects of $S_\text{churn}$ and $S_\text{noise}$ by fixing $S_\text{tmin}=0$ and $S_\text{tmax}\ge100$. The lower row examines the effects of $S_\text{tmin}$ and $S_\text{tmax}$ with fixed $S_\text{churn}=1$ and $S_\text{noise}=1$. Note that $S_\text{tmin}>=S_\text{tmax}$ is excluded as this leads to a deterministic sampler.
  • ...and 7 more figures