IDF-CR: Iterative Diffusion Process for Divide-and-Conquer Cloud Removal in Remote-sensing Images
Meilin Wang, Yexing Song, Pengxu Wei, Xiaoyu Xian, Yukai Shi, Liang Lin
TL;DR
Cloud cover poses a persistent challenge for optical remote-sensing analysis, with CNNs often struggling to capture long-range interactions. This work introduces IDF-CR, a two-stage framework that first performs pixel-space cloud removal using a Swin-transformer backbone with a cloudy-attention module, then refines the result in latent space via a diffusion process guided by ControlNet and an Iterative Noise Refinement (INR) module bridging to a VQ-VAE latent space. The latent diffusion stage yields high-fidelity cloud-free outputs and achieves state-of-the-art performance on public datasets such as RICE and WHUS2-CRv, outperforming CNN, GAN, and diffusion baselines in both reference-based and no-reference metrics. By combining strong spatial modeling in pixel space with robust generative refinement in latent space, IDF-CR improves texture detail and overall image quality for downstream remote-sensing tasks.
Abstract
Deep learning technologies have demonstrated their effectiveness in removing cloud cover from optical remote-sensing images. Convolutional Neural Networks (CNNs) exert dominance in the cloud removal tasks. However, constrained by the inherent limitations of convolutional operations, CNNs can address only a modest fraction of cloud occlusion. In recent years, diffusion models have achieved state-of-the-art (SOTA) proficiency in image generation and reconstruction due to their formidable generative capabilities. Inspired by the rapid development of diffusion models, we first present an iterative diffusion process for cloud removal (IDF-CR), which exhibits a strong generative capabilities to achieve component divide-and-conquer cloud removal. IDF-CR consists of a pixel space cloud removal module (Pixel-CR) and a latent space iterative noise diffusion network (IND). Specifically, IDF-CR is divided into two-stage models that address pixel space and latent space. The two-stage model facilitates a strategic transition from preliminary cloud reduction to meticulous detail refinement. In the pixel space stage, Pixel-CR initiates the processing of cloudy images, yielding a suboptimal cloud removal prior to providing the diffusion model with prior cloud removal knowledge. In the latent space stage, the diffusion model transforms low-quality cloud removal into high-quality clean output. We refine the Stable Diffusion by implementing ControlNet. In addition, an unsupervised iterative noise refinement (INR) module is introduced for diffusion model to optimize the distribution of the predicted noise, thereby enhancing advanced detail recovery. Our model performs best with other SOTA methods, including image reconstruction and optical remote-sensing cloud removal on the optical remote-sensing datasets.
