Table of Contents
Fetching ...

IDF-CR: Iterative Diffusion Process for Divide-and-Conquer Cloud Removal in Remote-sensing Images

Meilin Wang, Yexing Song, Pengxu Wei, Xiaoyu Xian, Yukai Shi, Liang Lin

TL;DR

Cloud cover poses a persistent challenge for optical remote-sensing analysis, with CNNs often struggling to capture long-range interactions. This work introduces IDF-CR, a two-stage framework that first performs pixel-space cloud removal using a Swin-transformer backbone with a cloudy-attention module, then refines the result in latent space via a diffusion process guided by ControlNet and an Iterative Noise Refinement (INR) module bridging to a VQ-VAE latent space. The latent diffusion stage yields high-fidelity cloud-free outputs and achieves state-of-the-art performance on public datasets such as RICE and WHUS2-CRv, outperforming CNN, GAN, and diffusion baselines in both reference-based and no-reference metrics. By combining strong spatial modeling in pixel space with robust generative refinement in latent space, IDF-CR improves texture detail and overall image quality for downstream remote-sensing tasks.

Abstract

Deep learning technologies have demonstrated their effectiveness in removing cloud cover from optical remote-sensing images. Convolutional Neural Networks (CNNs) exert dominance in the cloud removal tasks. However, constrained by the inherent limitations of convolutional operations, CNNs can address only a modest fraction of cloud occlusion. In recent years, diffusion models have achieved state-of-the-art (SOTA) proficiency in image generation and reconstruction due to their formidable generative capabilities. Inspired by the rapid development of diffusion models, we first present an iterative diffusion process for cloud removal (IDF-CR), which exhibits a strong generative capabilities to achieve component divide-and-conquer cloud removal. IDF-CR consists of a pixel space cloud removal module (Pixel-CR) and a latent space iterative noise diffusion network (IND). Specifically, IDF-CR is divided into two-stage models that address pixel space and latent space. The two-stage model facilitates a strategic transition from preliminary cloud reduction to meticulous detail refinement. In the pixel space stage, Pixel-CR initiates the processing of cloudy images, yielding a suboptimal cloud removal prior to providing the diffusion model with prior cloud removal knowledge. In the latent space stage, the diffusion model transforms low-quality cloud removal into high-quality clean output. We refine the Stable Diffusion by implementing ControlNet. In addition, an unsupervised iterative noise refinement (INR) module is introduced for diffusion model to optimize the distribution of the predicted noise, thereby enhancing advanced detail recovery. Our model performs best with other SOTA methods, including image reconstruction and optical remote-sensing cloud removal on the optical remote-sensing datasets.

IDF-CR: Iterative Diffusion Process for Divide-and-Conquer Cloud Removal in Remote-sensing Images

TL;DR

Cloud cover poses a persistent challenge for optical remote-sensing analysis, with CNNs often struggling to capture long-range interactions. This work introduces IDF-CR, a two-stage framework that first performs pixel-space cloud removal using a Swin-transformer backbone with a cloudy-attention module, then refines the result in latent space via a diffusion process guided by ControlNet and an Iterative Noise Refinement (INR) module bridging to a VQ-VAE latent space. The latent diffusion stage yields high-fidelity cloud-free outputs and achieves state-of-the-art performance on public datasets such as RICE and WHUS2-CRv, outperforming CNN, GAN, and diffusion baselines in both reference-based and no-reference metrics. By combining strong spatial modeling in pixel space with robust generative refinement in latent space, IDF-CR improves texture detail and overall image quality for downstream remote-sensing tasks.

Abstract

Deep learning technologies have demonstrated their effectiveness in removing cloud cover from optical remote-sensing images. Convolutional Neural Networks (CNNs) exert dominance in the cloud removal tasks. However, constrained by the inherent limitations of convolutional operations, CNNs can address only a modest fraction of cloud occlusion. In recent years, diffusion models have achieved state-of-the-art (SOTA) proficiency in image generation and reconstruction due to their formidable generative capabilities. Inspired by the rapid development of diffusion models, we first present an iterative diffusion process for cloud removal (IDF-CR), which exhibits a strong generative capabilities to achieve component divide-and-conquer cloud removal. IDF-CR consists of a pixel space cloud removal module (Pixel-CR) and a latent space iterative noise diffusion network (IND). Specifically, IDF-CR is divided into two-stage models that address pixel space and latent space. The two-stage model facilitates a strategic transition from preliminary cloud reduction to meticulous detail refinement. In the pixel space stage, Pixel-CR initiates the processing of cloudy images, yielding a suboptimal cloud removal prior to providing the diffusion model with prior cloud removal knowledge. In the latent space stage, the diffusion model transforms low-quality cloud removal into high-quality clean output. We refine the Stable Diffusion by implementing ControlNet. In addition, an unsupervised iterative noise refinement (INR) module is introduced for diffusion model to optimize the distribution of the predicted noise, thereby enhancing advanced detail recovery. Our model performs best with other SOTA methods, including image reconstruction and optical remote-sensing cloud removal on the optical remote-sensing datasets.
Paper Structure (12 sections, 17 equations, 8 figures, 6 tables, 3 algorithms)

This paper contains 12 sections, 17 equations, 8 figures, 6 tables, 3 algorithms.

Figures (8)

  • Figure 1: Training and inference pipelines of the proposed component divide-and-conquer cloud removal. It consists of two stages: (Pixel Space): We pretrain a transformer-based cloud removal module (Pixel-CR) to perform the coarse elimination of clouds in pixel space. We provide a priori knowledge of the cloud removal $I_{Decloudy-LQ}$ for the diffusion model in latent space. (Latent Space): First, the encoder of the VQ-VAE $\varepsilon$ is employed to effectuate the transformation from the pixel space to the latent space. Then, the continuous variables are discretized based on the nearest distance search in the $CodeBook$. The cloud-free label and coarse cloud removal information are denoted as $z_0$ and the conditioning variable $C_{latent}$, respectively. High-quality cloud removal output $I_{Decloudy-HQ}$ is achieved by our proposed iterative noise diffusion (IND) module, which consists of ControlNet and iterative noise refinement (INR). ControlNet is a trainable parallel module tasked with acquiring knowledge of the data distributions associated with $C_{latent}$ and the true vector $z_t$. INR creates intricate noise patterns to enhance the precision noise and strengthen the model robustness. Finally, $z_0$ is projected back into pixel space by the VQ-VAE decoder $\mathcal{D}$. During the inference, the noise $Z_T$ is stochastically drawn from a normal distribution $\mathcal{N}(0, I)$. The uppercase $Z$ and lowercase $z$ refer to the inference and training stages, respectively.
  • Figure 2: Graphical representation of Pixel Cloud Removal module (Pixel-CR).
  • Figure 3: Graphical representation of the iterative noise refinement (INR) module. We present an instance of data distribution refinement for INR (upper row) and the visualization outcomes subsequent to the respective noise sampling (lower row). (Row 1): the curves show the distribution of the data. The blue solid line denotes the true noise, while the red dotted line signifies the representation after the diffusion process of $z_0$. The red solid line, in turn, represents the outcome of the noise predicted by UNet $\theta$. The direction of the gradient updates from the red solid line to the blue solid line. (Row 2): visualization showing the prediction noise over different iterations. $S(\epsilon)$ denotes the sampled result with cloud-free. As the number of iterations increases, a gradual improvement in both color contrast and texture refinement is observed.
  • Figure 4: Pixel space qualitative analysis of the proposed and existing methods: C2PNet zheng2023curricular, RIDCP wu2023ridcp, SGID-PFF bai2022self, Spa-GAN spa-gan, SwinIR swinir, DiffBIR diffbir for thin cloud removal performance in different natural environments on the RICE1 dataset rice.
  • Figure 5: Pixel space qualitative comparison of the cloud removal results from different cloud cover on the RICE2 dataset.
  • ...and 3 more figures