Table of Contents
Fetching ...

CromSS: Cross-modal pre-training with noisy labels for remote sensing image segmentation

Chenying Liu, Conrad Albrecht, Yi Wang, Xiao Xiang Zhu

TL;DR

The paper tackles the challenge of scarce high-quality labels for remote sensing segmentation by introducing CromSS, a cross-modal pretraining framework that leverages large-scale noisy labels from optical and SAR modalities. By pairing two modality-specific models with inter-modal consistency losses and exploring middle and late fusion, CromSS integrates a cross-modal sample-selection mechanism and spatial-temporal label smoothing to mitigate noise while retaining fine-grained details. The authors assemble NoLDO-S12, a multi-modal dataset combining SSL4EO-S12@NoL with DW-derived noisy labels for pretraining and two high-quality downstream tasks (DW and OSM), plus DFC2020 for transfer evaluation. Across three downstream datasets, CromSS demonstrates competitive or superior performance, particularly for multi-spectral S2 inputs, and reveals insights into encoder/decoder behavior and the trade-offs of sample selection, with extensive ablations and analysis guiding future noisy-label pretraining for geospatial segmentation.

Abstract

We explore the potential of large-scale noisily labeled data to enhance feature learning by pretraining semantic segmentation models within a multi-modal framework for geospatial applications. We propose a novel Cross-modal Sample Selection (CromSS) method, a weakly supervised pretraining strategy designed to improve feature representations through cross-modal consistency and noise mitigation techniques. Unlike conventional pretraining approaches, CromSS exploits massive amounts of noisy and easy-to-come-by labels for improved feature learning beneficial to semantic segmentation tasks. We investigate middle and late fusion strategies to optimize the multi-modal pretraining architecture design. We also introduce a cross-modal sample selection module to mitigate the adverse effects of label noise, which employs a cross-modal entangling strategy to refine the estimated confidence masks within each modality to guide the sampling process. Additionally, we introduce a spatial-temporal label smoothing technique to counteract overconfidence for enhanced robustness against noisy labels. To validate our approach, we assembled the multi-modal dataset, NoLDO-S12, which consists of a large-scale noisy label subset from Google's Dynamic World (DW) dataset for pretraining and two downstream subsets with high-quality labels from Google DW and OpenStreetMap (OSM) for transfer learning. Experimental results on two downstream tasks and the publicly available DFC2020 dataset demonstrate that when effectively utilized, the low-cost noisy labels can significantly enhance feature learning for segmentation tasks. All data, code, and pretrained weights will be made publicly available.

CromSS: Cross-modal pre-training with noisy labels for remote sensing image segmentation

TL;DR

The paper tackles the challenge of scarce high-quality labels for remote sensing segmentation by introducing CromSS, a cross-modal pretraining framework that leverages large-scale noisy labels from optical and SAR modalities. By pairing two modality-specific models with inter-modal consistency losses and exploring middle and late fusion, CromSS integrates a cross-modal sample-selection mechanism and spatial-temporal label smoothing to mitigate noise while retaining fine-grained details. The authors assemble NoLDO-S12, a multi-modal dataset combining SSL4EO-S12@NoL with DW-derived noisy labels for pretraining and two high-quality downstream tasks (DW and OSM), plus DFC2020 for transfer evaluation. Across three downstream datasets, CromSS demonstrates competitive or superior performance, particularly for multi-spectral S2 inputs, and reveals insights into encoder/decoder behavior and the trade-offs of sample selection, with extensive ablations and analysis guiding future noisy-label pretraining for geospatial segmentation.

Abstract

We explore the potential of large-scale noisily labeled data to enhance feature learning by pretraining semantic segmentation models within a multi-modal framework for geospatial applications. We propose a novel Cross-modal Sample Selection (CromSS) method, a weakly supervised pretraining strategy designed to improve feature representations through cross-modal consistency and noise mitigation techniques. Unlike conventional pretraining approaches, CromSS exploits massive amounts of noisy and easy-to-come-by labels for improved feature learning beneficial to semantic segmentation tasks. We investigate middle and late fusion strategies to optimize the multi-modal pretraining architecture design. We also introduce a cross-modal sample selection module to mitigate the adverse effects of label noise, which employs a cross-modal entangling strategy to refine the estimated confidence masks within each modality to guide the sampling process. Additionally, we introduce a spatial-temporal label smoothing technique to counteract overconfidence for enhanced robustness against noisy labels. To validate our approach, we assembled the multi-modal dataset, NoLDO-S12, which consists of a large-scale noisy label subset from Google's Dynamic World (DW) dataset for pretraining and two downstream subsets with high-quality labels from Google DW and OpenStreetMap (OSM) for transfer learning. Experimental results on two downstream tasks and the publicly available DFC2020 dataset demonstrate that when effectively utilized, the low-cost noisy labels can significantly enhance feature learning for segmentation tasks. All data, code, and pretrained weights will be made publicly available.
Paper Structure (15 sections, 15 equations, 13 figures, 6 tables)

This paper contains 15 sections, 15 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Illustration of the pretraining set SSL4EO-S12@NoL in NoLDO-S12. From left to right: global distribution of samples (left), 4-season samples (top-down) at 3 geolocations (middle), and statistics of the classes of the noisy labels (right).
  • Figure 2: Illustration of the two downstream tasks in NoLDO-S12 with different label sources (SSL4EO-S12@DW and SSL4EO-S12@OSM). Top (left and right): global data distributions(DW and OSM). Middle (left and right): class distributions of training and test sets along with corresponding legends (DW and OSM). Bottom: examples from 2 locations. The legend for DW labels is the same as that in Figure \ref{['fig:data:pre-train']}.
  • Figure 3: Overview of the proposed cross-modal sample selection (CromSS) method.
  • Figure 4: Comparison of single-modal training and multi-modal (middle/ late) fusion strategies.
  • Figure 5: Sample selection mask generation in three steps: confidence mask generation, cross-modal confidence enhancement, and thresholding, where $d$ and $d'$ represent two different modalities, $\alpha$ and $\gamma$ are the thresholds used to derive the sample selection masks for the segmentation and consistency losses, respectively.
  • ...and 8 more figures