Table of Contents
Fetching ...

SiamSeg: Self-Training with Contrastive Learning for Unsupervised Domain Adaptation Semantic Segmentation in Remote Sensing

Bin Wang, Fei Deng, Shuang Wang, Wen Luo, Zhixuan Zhang, Peifan Jiang

TL;DR

SiamSeg addresses cross-domain semantic segmentation in remote sensing by integrating contrastive learning with self-training, enabling stronger target-domain feature learning under unlabeled data. The method combines a standard segmentation network with an EMA teacher for pseudo-labeling and a Siamese contrastive branch that generates two augmented target views to maximize representation quality through a negative cosine similarity objective. Empirical results on Potsdam, Vaihingen, and LoveDA demonstrate state-of-the-art performance across multiple cross-domain tasks, with ablations confirming the value of contrastive supervision and informative augmentations. The approach offers practical benefits for RS applications with limited labeled data, and code is publicly available for reproducibility; future work aims to reduce reliance on source data via SFDA and enhance generalization across tasks.

Abstract

Semantic segmentation of remote sensing (RS) images is a challenging yet essential task with broad applications. While deep learning, particularly supervised learning with large-scale labeled datasets, has significantly advanced this field, the acquisition of high-quality labeled data remains costly and time-intensive. Unsupervised domain adaptation (UDA) provides a promising alternative by enabling models to learn from unlabeled target domain data while leveraging labeled source domain data. Recent self-training (ST) approaches employing pseudo-label generation have shown potential in mitigating domain discrepancies. However, the application of ST to RS image segmentation remains underexplored. Factors such as variations in ground sampling distance, imaging equipment, and geographic diversity exacerbate domain shifts, limiting model performance across domains. In that case, existing ST methods, due to significant domain shifts in cross-domain RS images, often underperform. To address these challenges, we propose integrating contrastive learning into UDA, enhancing the model's ability to capture semantic information in the target domain by maximizing the similarity between augmented views of the same image. This additional supervision improves the model's representational capacity and segmentation performance in the target domain. Extensive experiments conducted on RS datasets, including Potsdam, Vaihingen, and LoveDA, demonstrate that our method, SimSeg, outperforms existing approaches, achieving state-of-the-art results. Visualization and quantitative analyses further validate SimSeg's superior ability to learn from the target domain. The code is publicly available at https://github.com/woldier/SiamSeg.

SiamSeg: Self-Training with Contrastive Learning for Unsupervised Domain Adaptation Semantic Segmentation in Remote Sensing

TL;DR

SiamSeg addresses cross-domain semantic segmentation in remote sensing by integrating contrastive learning with self-training, enabling stronger target-domain feature learning under unlabeled data. The method combines a standard segmentation network with an EMA teacher for pseudo-labeling and a Siamese contrastive branch that generates two augmented target views to maximize representation quality through a negative cosine similarity objective. Empirical results on Potsdam, Vaihingen, and LoveDA demonstrate state-of-the-art performance across multiple cross-domain tasks, with ablations confirming the value of contrastive supervision and informative augmentations. The approach offers practical benefits for RS applications with limited labeled data, and code is publicly available for reproducibility; future work aims to reduce reliance on source data via SFDA and enhance generalization across tasks.

Abstract

Semantic segmentation of remote sensing (RS) images is a challenging yet essential task with broad applications. While deep learning, particularly supervised learning with large-scale labeled datasets, has significantly advanced this field, the acquisition of high-quality labeled data remains costly and time-intensive. Unsupervised domain adaptation (UDA) provides a promising alternative by enabling models to learn from unlabeled target domain data while leveraging labeled source domain data. Recent self-training (ST) approaches employing pseudo-label generation have shown potential in mitigating domain discrepancies. However, the application of ST to RS image segmentation remains underexplored. Factors such as variations in ground sampling distance, imaging equipment, and geographic diversity exacerbate domain shifts, limiting model performance across domains. In that case, existing ST methods, due to significant domain shifts in cross-domain RS images, often underperform. To address these challenges, we propose integrating contrastive learning into UDA, enhancing the model's ability to capture semantic information in the target domain by maximizing the similarity between augmented views of the same image. This additional supervision improves the model's representational capacity and segmentation performance in the target domain. Extensive experiments conducted on RS datasets, including Potsdam, Vaihingen, and LoveDA, demonstrate that our method, SimSeg, outperforms existing approaches, achieving state-of-the-art results. Visualization and quantitative analyses further validate SimSeg's superior ability to learn from the target domain. The code is publicly available at https://github.com/woldier/SiamSeg.

Paper Structure

This paper contains 33 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The main challenges in the task of cross-domain semantic segmentation of remote sensing images. These challenges include the problem of domain bias due to ground sampling distances, sensor differences, and variations in geographic landscapes, which affect the model's ability to generalize across different datasets. Understanding these domain shift issues is crucial for improving the accuracy and robustness of semantic segmentation of RS images.
  • Figure 2: Overall of SiamSeg. The network $g_\theta$ is designed for image segmentation and comprises a feature extraction backbone $f$ and a decoding head, an EMA teacher network $t_\theta$ and a contrastive network.
  • Figure 3: Detail of Contrastive Network. This figure illustrates the architecture of the Siamese network used for contrastive learning. The network consists of two identical sub-networks that share the same model weights, ensuring consistency in feature extraction.
  • Figure 4: Visualization of results on Potsdam and Vaihingen datasets. The cross-domain tasks from top to bottom are Potsdam IRRG to Vaihingen IRRG and Potsdam RGB to Vaihingen IRRG. The categories represented by the different colors are listed at the bottom of the picture with their names and colors.
  • Figure 5: Visualization of results on LoveDA datasets. We conduct one task which is Rural to Urban. We provide the visualization results on LoveDA dataset. Since images in the testing dataset do not have annotations, we display the results of images in the validation dataset.