Cross-Modal Spherical Aggregation for Weakly Supervised Remote Sensing Shadow Removal
Kaichen Chi, Wei Jing, Junjie Li, Qiang Li, Qi Wang
TL;DR
This work tackles shadow removal in remote sensing by exploiting cross-modal information from visible and infrared images under weak supervision. It introduces S2-ShadowNet, which translates visible data to infrared, extracts rich features with Swin Transformers, and maps them into a spherical space to decompose representations into shared (align) and private (separ) components, guided by orthogonality and similarity losses along with adversarial and identity objectives. The approach leverages modal translation and a spherical aggregation framework to reduce domain shift and recover shadow-free imagery without requiring ground-truth pairs, and it introduces the large-scale WSSR benchmark. Experimental results on UAV-SC, WSSR, and URSSR show state-of-the-art performance in full-reference and strong gains in no-reference metrics, validating the effectiveness of cross-modal spherical aggregation for shadow removal in diverse remote-sensing scenarios. The work advances practical shadow removal by enabling robust, detail-preserving restoration under weak supervision, with significant implications for downstream tasks in Earth observation.
Abstract
Remote sensing shadow removal, which aims to recover contaminated surface information, is tricky since shadows typically display overwhelmingly low illumination intensities. In contrast, the infrared image is robust toward significant light changes, providing visual clues complementary to the visible image. Nevertheless, the existing methods ignore the collaboration between heterogeneous modalities, leading to undesired quality degradation. To fill this gap, we propose a weakly supervised shadow removal network with a spherical feature space, dubbed S2-ShadowNet, to explore the best of both worlds for visible and infrared modalities. Specifically, we employ a modal translation (visible-to-infrared) model to learn the cross-domain mapping, thus generating realistic infrared samples. Then, Swin Transformer is utilized to extract strong representational visible/infrared features. Simultaneously, the extracted features are mapped to the smooth spherical manifold, which alleviates the domain shift through regularization. Well-designed similarity loss and orthogonality loss are embedded into the spherical space, prompting the separation of private visible/infrared features and the alignment of shared visible/infrared features through constraints on both representation content and orientation. Such a manner encourages implicit reciprocity between modalities, thus providing a novel insight into shadow removal. Notably, ground truth is not available in practice, thus S2-ShadowNet is trained by cropping shadow and shadow-free patches from the shadow image itself, avoiding stereotypical and strict pair data acquisition. More importantly, we contribute a large-scale weakly supervised shadow removal benchmark, including 4000 shadow images with corresponding shadow masks.
