CromSS: Cross-modal pre-training with noisy labels for remote sensing image segmentation
Chenying Liu, Conrad Albrecht, Yi Wang, Xiao Xiang Zhu
TL;DR
The paper tackles the challenge of scarce high-quality labels for remote sensing segmentation by introducing CromSS, a cross-modal pretraining framework that leverages large-scale noisy labels from optical and SAR modalities. By pairing two modality-specific models with inter-modal consistency losses and exploring middle and late fusion, CromSS integrates a cross-modal sample-selection mechanism and spatial-temporal label smoothing to mitigate noise while retaining fine-grained details. The authors assemble NoLDO-S12, a multi-modal dataset combining SSL4EO-S12@NoL with DW-derived noisy labels for pretraining and two high-quality downstream tasks (DW and OSM), plus DFC2020 for transfer evaluation. Across three downstream datasets, CromSS demonstrates competitive or superior performance, particularly for multi-spectral S2 inputs, and reveals insights into encoder/decoder behavior and the trade-offs of sample selection, with extensive ablations and analysis guiding future noisy-label pretraining for geospatial segmentation.
Abstract
We explore the potential of large-scale noisily labeled data to enhance feature learning by pretraining semantic segmentation models within a multi-modal framework for geospatial applications. We propose a novel Cross-modal Sample Selection (CromSS) method, a weakly supervised pretraining strategy designed to improve feature representations through cross-modal consistency and noise mitigation techniques. Unlike conventional pretraining approaches, CromSS exploits massive amounts of noisy and easy-to-come-by labels for improved feature learning beneficial to semantic segmentation tasks. We investigate middle and late fusion strategies to optimize the multi-modal pretraining architecture design. We also introduce a cross-modal sample selection module to mitigate the adverse effects of label noise, which employs a cross-modal entangling strategy to refine the estimated confidence masks within each modality to guide the sampling process. Additionally, we introduce a spatial-temporal label smoothing technique to counteract overconfidence for enhanced robustness against noisy labels. To validate our approach, we assembled the multi-modal dataset, NoLDO-S12, which consists of a large-scale noisy label subset from Google's Dynamic World (DW) dataset for pretraining and two downstream subsets with high-quality labels from Google DW and OpenStreetMap (OSM) for transfer learning. Experimental results on two downstream tasks and the publicly available DFC2020 dataset demonstrate that when effectively utilized, the low-cost noisy labels can significantly enhance feature learning for segmentation tasks. All data, code, and pretrained weights will be made publicly available.
