Table of Contents
Fetching ...

Lean Learning Beyond Clouds: Efficient Discrepancy-Conditioned Optical-SAR Fusion for Semantic Segmentation

Chenxing Meng, Wuzhou Quan, Yingjie Cai, Liqun Cao, Liyan Zhang, Mingqiang Wei

Abstract

Cloud occlusion severely degrades the semantic integrity of optical remote sensing imagery. While incorporating Synthetic Aperture Radar (SAR) provides complementary observations, achieving efficient global modeling and reliable cross-modal fusion under cloud interference remains challenging. Existing methods rely on dense global attention to capture long-range dependencies, yet such aggregation indiscriminately propagates cloud-induced noise. Improving robustness typically entails enlarging model capacity, which further increases computational overhead. Given the large-scale and high-resolution nature of remote sensing applications, such computational demands hinder practical deployment, leading to an efficiency-reliability trade-off. To address this dilemma, we propose EDC, an efficiency-oriented and discrepancy-conditioned optical-SAR semantic segmentation framework. A tri-stream encoder with Carrier Tokens enables compact global context modeling with reduced complexity. To prevent noise contamination, we introduce a Discrepancy-Conditioned Hybrid Fusion (DCHF) mechanism that selectively suppresses unreliable regions during global aggregation. In addition, an auxiliary cloud removal branch with teacher-guided distillation enhances semantic consistency under occlusion. Extensive experiments demonstrate that EDC achieves superior accuracy and efficiency, improving mIoU by 0.56\% and 0.88\% on M3M-CR and WHU-OPT-SAR, respectively, while reducing the number of parameters by 46.7\% and accelerating inference by 1.98$\times$. Our implementation is available at https://github.com/mengcx0209/EDC.

Lean Learning Beyond Clouds: Efficient Discrepancy-Conditioned Optical-SAR Fusion for Semantic Segmentation

Abstract

Cloud occlusion severely degrades the semantic integrity of optical remote sensing imagery. While incorporating Synthetic Aperture Radar (SAR) provides complementary observations, achieving efficient global modeling and reliable cross-modal fusion under cloud interference remains challenging. Existing methods rely on dense global attention to capture long-range dependencies, yet such aggregation indiscriminately propagates cloud-induced noise. Improving robustness typically entails enlarging model capacity, which further increases computational overhead. Given the large-scale and high-resolution nature of remote sensing applications, such computational demands hinder practical deployment, leading to an efficiency-reliability trade-off. To address this dilemma, we propose EDC, an efficiency-oriented and discrepancy-conditioned optical-SAR semantic segmentation framework. A tri-stream encoder with Carrier Tokens enables compact global context modeling with reduced complexity. To prevent noise contamination, we introduce a Discrepancy-Conditioned Hybrid Fusion (DCHF) mechanism that selectively suppresses unreliable regions during global aggregation. In addition, an auxiliary cloud removal branch with teacher-guided distillation enhances semantic consistency under occlusion. Extensive experiments demonstrate that EDC achieves superior accuracy and efficiency, improving mIoU by 0.56\% and 0.88\% on M3M-CR and WHU-OPT-SAR, respectively, while reducing the number of parameters by 46.7\% and accelerating inference by 1.98. Our implementation is available at https://github.com/mengcx0209/EDC.
Paper Structure (32 sections, 20 equations, 7 figures, 6 tables)

This paper contains 32 sections, 20 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Efficiency–accuracy comparison on M3M-CR. Bubble size indicates model parameter scale. The annotation beside each bubble gives the triplet (throughput, mIoU, parameters). EDC lies on the favorable Pareto frontier, achieving the best mIoU and the highest inference throughput simultaneously.
  • Figure 2: Overall framework of the proposed EDC. The StudentNet takes the SAR image $I_{\text{SAR}}$ and the cloudy optical image $I_{\text{Cloudy}}$ as input. At each stage $i$, the encoder produces an optical-corrected feature $Opt_i$ and a fused semantic feature $CM_i^{\text{Fuse}}$ at the corresponding resolution. Note that $CM_1^{\text{Fuse}}$ is initialized as an empty input (i.e., no prior fused feature is available at the first stage). The cloud-removal (CR) decoder aggregates $\{Opt_i\}_{i=1}^{4}$ to reconstruct the cloud-removed optical image, while the semantic-segmentation (SS) decoder aggregates $\{CM_i^{\text{Fuse}}\}_{i=1}^{4}$ to predict the land-cover map. During training, a TeacherNet with the same architecture is fed with $I_{\text{SAR}}$ and the cloud-free optical image $I_{\text{Cloudfree}}$ and provides supervision to the StudentNet via a cloud-mask-guided distillation loss.
  • Figure 3: Framework of the proposed Efficiency-Oriented Multi-Scale Encoder.
  • Figure 4: Architecture of the proposed DCHF module. It uses discrepancy-guided attention map $\mathbf{A}$ to perform weighted GAP for robust channel recalibration and fusion, producing $CM_i^{\text{Fuse}}$ and refined $\{Opt_i,SAR_i\}$ from $\{Opt_{i-1},SAR_{i-1},CM_{i-1}^{\text{Fuse}}\}$ (with $CM_{1}^{\text{Fuse}}$ empty). $\oplus/\ominus/\otimes$: element-wise ops; $\mathrm{C}$: concatenation.
  • Figure 5: Visualization of land cover mapping for 4 different scenes. The first two scenes are from the M3M-CR dataset, while the latter two are from the WHU-OPT-SAR dataset. For each scene, from top-left to bottom-right are respectively the cloudy image, the cloud-free image, the SAR image, the ground truth, the result from AMM-FuseNet, MCANet, DCSA-Net, FTransUNet, CMX, CloudSeg and our EDC.
  • ...and 2 more figures