MSCloudCAM: Multi-Scale Context Adaptation with Convolutional Cross-Attention for Multispectral Cloud Segmentation
Md Abdullah Al Mazid, Liangdong Deng, Naphtali Rishe
TL;DR
Clouds significantly hinder optical remote sensing analysis, especially across heterogeneous multispectral sensors. MSCloudCAM introduces a convolution-based cross-attention framework that fuses dual multi-scale context encoders—one prioritizing fine spatial details and the other global semantics—via a learned cross-attention mechanism. The approach, integrated into a Swin Transformer backbone with a deep-supervision decoder, achieves state-of-the-art performance on CloudSEN12 and L8Biome while maintaining competitive complexity. This method provides robust spectral–spatial discrimination across sensors, enabling more reliable cloud masks for environmental and climate analyses.
Abstract
Clouds remain a major obstacle in optical satellite imaging, limiting accurate environmental and climate analysis. To address the strong spectral variability and the large scale differences among cloud types, we propose MSCloudCAM, a novel multi-scale context adapter network with convolution based cross-attention tailored for multispectral and multi-sensor cloud segmentation. A key contribution of MSCloudCAM is the explicit modeling of multiple complementary multi-scale context extractors. And also, rather than simply stacking or concatenating their outputs, our formulation uses one extractor's fine-resolution features and the other extractor's global contextual representations enabling dynamic, scale-aware feature selection. Building on this idea, we design a new convolution-based cross attention adapter that effectively fuses localized, detailed information with broader multi-scale context. Integrated with a hierarchical vision backbone and refined through channel and spatial attention mechanisms, MSCloudCAM achieves strong spectral-spatial discrimination. Experiments on various multisensor datatsets e.g. CloudSEN12 (Sentinel-2) and L8Biome (Landsat-8) show that MSCloudCAM outperforms recent state-of-the-art models while maintaining competitive model complexity, highlighting the novelty and effectiveness of the proposed design for large-scale Earth observation.
