Table of Contents
Fetching ...

Improving satellite imagery segmentation using multiple Sentinel-2 revisits

Kartik Jindgar, Grace W. Lindsay

TL;DR

This study addresses how to best exploit multiple revisits in Sentinel-2 imagery when fine-tuning pre-trained remote sensing models for semantic segmentation. It systematically compares five multi-temporal input strategies across three architectures (U-Net, ViT, SWIN) and finds that latent-space fusion of revisits, particularly with a SWIN transformer backbone, yields the strongest performance gains. The approach generalizes beyond the target task, as demonstrated by consistent improvements on the PhilEO building-density dataset, underscoring its practical value for climate-relevant remote sensing applications. The results offer a robust, scalable method for leveraging temporal information in land-cover tasks and inform future hybrid fusion strategies for satellite imagery analysis.

Abstract

In recent years, analysis of remote sensing data has benefited immensely from borrowing techniques from the broader field of computer vision, such as the use of shared models pre-trained on large and diverse datasets. However, satellite imagery has unique features that are not accounted for in traditional computer vision, such as the existence of multiple revisits of the same location. Here, we explore the best way to use revisits in the framework of fine-tuning pre-trained remote sensing models. We focus on an applied research question of relevance to climate change mitigation -- power substation segmentation -- that is representative of applied uses of pre-trained models more generally. Through extensive tests of different multi-temporal input schemes across diverse model architectures, we find that fusing representations from multiple revisits in the model latent space is superior to other methods of using revisits, including as a form of data augmentation. We also find that a SWIN Transformer-based architecture performs better than U-nets and ViT-based models. We verify the generality of our results on a separate building density estimation task.

Improving satellite imagery segmentation using multiple Sentinel-2 revisits

TL;DR

This study addresses how to best exploit multiple revisits in Sentinel-2 imagery when fine-tuning pre-trained remote sensing models for semantic segmentation. It systematically compares five multi-temporal input strategies across three architectures (U-Net, ViT, SWIN) and finds that latent-space fusion of revisits, particularly with a SWIN transformer backbone, yields the strongest performance gains. The approach generalizes beyond the target task, as demonstrated by consistent improvements on the PhilEO building-density dataset, underscoring its practical value for climate-relevant remote sensing applications. The results offer a robust, scalable method for leveraging temporal information in land-cover tasks and inform future hybrid fusion strategies for satellite imagery analysis.

Abstract

In recent years, analysis of remote sensing data has benefited immensely from borrowing techniques from the broader field of computer vision, such as the use of shared models pre-trained on large and diverse datasets. However, satellite imagery has unique features that are not accounted for in traditional computer vision, such as the existence of multiple revisits of the same location. Here, we explore the best way to use revisits in the framework of fine-tuning pre-trained remote sensing models. We focus on an applied research question of relevance to climate change mitigation -- power substation segmentation -- that is representative of applied uses of pre-trained models more generally. Through extensive tests of different multi-temporal input schemes across diverse model architectures, we find that fusing representations from multiple revisits in the model latent space is superior to other methods of using revisits, including as a form of data augmentation. We also find that a SWIN Transformer-based architecture performs better than U-nets and ViT-based models. We verify the generality of our results on a separate building density estimation task.
Paper Structure (16 sections, 5 figures, 9 tables)

This paper contains 16 sections, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Images from multiple revisits along with mask
  • Figure 2: Segmentation model with SWIN transformer backbone and multi-temporal input
  • Figure 3: U-Net with a ResNet50 backbone and multi-temporal input
  • Figure 4: Predictions by the SWIN model on the power-substation dataset
  • Figure 5: a) Architecture of decoder, comprising of multiple up-scaling blocks followed by a convolution layer, used in the ViT based model; b) Architecture of up-scaling block used in the decoder