Table of Contents
Fetching ...

Multimodal Diffusion Bridge with Attention-Based SAR Fusion for Satellite Image Cloud Removal

Yuyang Hu, Suhas Lohit, Ulugbek S. Kamilov, Tim K. Marks

TL;DR

This work addresses cloud removal in optical satellite imagery by formulating it as a diffusion-bridge problem conditioned on aligned SAR data. The core approach, DB-CR, introduces a multimodal diffusion bridge with a two-branch SAR–optical backbone and cross-modal attention to fuse structural SAR information with spectral optical details, enabling stable, high-fidelity restoration. Experimental results on SEN12MS-CR demonstrate state-of-the-art distortion and perceptual quality with competitive computational efficiency, and ablations highlight the importance of the diffusion-bridge training and fusion components. The proposed diffusion-bridge framework and multimodal fusion strategy offer a practical, robust solution for cloud removal with controllable inference and strong potential for deployment in remote sensing workflows.

Abstract

Deep learning has achieved some success in addressing the challenge of cloud removal in optical satellite images, by fusing with synthetic aperture radar (SAR) images. Recently, diffusion models have emerged as powerful tools for cloud removal, delivering higher-quality estimation by sampling from cloud-free distributions, compared to earlier methods. However, diffusion models initiate sampling from pure Gaussian noise, which complicates the sampling trajectory and results in suboptimal performance. Also, current methods fall short in effectively fusing SAR and optical data. To address these limitations, we propose Diffusion Bridges for Cloud Removal, DB-CR, which directly bridges between the cloudy and cloud-free image distributions. In addition, we propose a novel multimodal diffusion bridge architecture with a two-branch backbone for multimodal image restoration, incorporating an efficient backbone and dedicated cross-modality fusion blocks to effectively extract and fuse features from synthetic aperture radar (SAR) and optical images. By formulating cloud removal as a diffusion-bridge problem and leveraging this tailored architecture, DB-CR achieves high-fidelity results while being computationally efficient. We evaluated DB-CR on the SEN12MS-CR cloud-removal dataset, demonstrating that it achieves state-of-the-art results.

Multimodal Diffusion Bridge with Attention-Based SAR Fusion for Satellite Image Cloud Removal

TL;DR

This work addresses cloud removal in optical satellite imagery by formulating it as a diffusion-bridge problem conditioned on aligned SAR data. The core approach, DB-CR, introduces a multimodal diffusion bridge with a two-branch SAR–optical backbone and cross-modal attention to fuse structural SAR information with spectral optical details, enabling stable, high-fidelity restoration. Experimental results on SEN12MS-CR demonstrate state-of-the-art distortion and perceptual quality with competitive computational efficiency, and ablations highlight the importance of the diffusion-bridge training and fusion components. The proposed diffusion-bridge framework and multimodal fusion strategy offer a practical, robust solution for cloud removal with controllable inference and strong potential for deployment in remote sensing workflows.

Abstract

Deep learning has achieved some success in addressing the challenge of cloud removal in optical satellite images, by fusing with synthetic aperture radar (SAR) images. Recently, diffusion models have emerged as powerful tools for cloud removal, delivering higher-quality estimation by sampling from cloud-free distributions, compared to earlier methods. However, diffusion models initiate sampling from pure Gaussian noise, which complicates the sampling trajectory and results in suboptimal performance. Also, current methods fall short in effectively fusing SAR and optical data. To address these limitations, we propose Diffusion Bridges for Cloud Removal, DB-CR, which directly bridges between the cloudy and cloud-free image distributions. In addition, we propose a novel multimodal diffusion bridge architecture with a two-branch backbone for multimodal image restoration, incorporating an efficient backbone and dedicated cross-modality fusion blocks to effectively extract and fuse features from synthetic aperture radar (SAR) and optical images. By formulating cloud removal as a diffusion-bridge problem and leveraging this tailored architecture, DB-CR achieves high-fidelity results while being computationally efficient. We evaluated DB-CR on the SEN12MS-CR cloud-removal dataset, demonstrating that it achieves state-of-the-art results.

Paper Structure

This paper contains 23 sections, 18 equations, 7 figures, 6 tables, 2 algorithms.

Figures (7)

  • Figure 1: Illustration of DB-CR's iterative process: leveraging SAR data and training a diffusion bridge approach to progressively refine cloud-covered images, reducing cloud cover step by step to produce a clear, cloud-free result.
  • Figure 2: A schematic diagram of our method, illustrating the inference process. Each iteration alternates between a reverse diffusion-bridge step, which estimates the cloud-free state, and a forward diffusion-bridge step, which progresses the state to the next diffusion timestep. The process starts from the cloudy image, progressively refining it toward a cloud-free output. $T$ is the number of diffusion timesteps used in training the diffusion bridge. During inference, $N = \frac{T}{s}$ diffusion timesteps are used, where $s$ is the step size. ${\bm{z}}$ refers to the SAR image.
  • Figure 3: The architecture of the DB-CR backbone network. (a) This two-branch design combines a U-Net restoration branch (bottom), which utilizes NAFBlocks (purple rectangles) for efficient restoration, and a SAR feature extraction branch (top). The network utilizes SFBlocks (orange rectangles) for feature fusion across branches. (b) Detailed architecture of each SFBlock.
  • Figure 4: $\alpha_t$ scheduling for DB-CR, which follows sine-based distribution.
  • Figure 5: Comparison of DB-CR vs. baseline methods on two agricultural scenes. For each image in the first and third rows, the detail region enclosed in the red rectangle is magnified in the subsequent row. For each scene, the original cloudy image is shown in the leftmost column, and the reference image (actual cloud-free image of the same area as the cloudy image) is in the rightmost column. Compared to the baseline methods, DB-CR achieves finer detail recovery, accurately preserves edges, and provides sharper details with superior perceptual quality.
  • ...and 2 more figures