Improving satellite imagery segmentation using multiple Sentinel-2 revisits

Kartik Jindgar; Grace W. Lindsay

Improving satellite imagery segmentation using multiple Sentinel-2 revisits

Kartik Jindgar, Grace W. Lindsay

TL;DR

This study addresses how to best exploit multiple revisits in Sentinel-2 imagery when fine-tuning pre-trained remote sensing models for semantic segmentation. It systematically compares five multi-temporal input strategies across three architectures (U-Net, ViT, SWIN) and finds that latent-space fusion of revisits, particularly with a SWIN transformer backbone, yields the strongest performance gains. The approach generalizes beyond the target task, as demonstrated by consistent improvements on the PhilEO building-density dataset, underscoring its practical value for climate-relevant remote sensing applications. The results offer a robust, scalable method for leveraging temporal information in land-cover tasks and inform future hybrid fusion strategies for satellite imagery analysis.

Abstract

In recent years, analysis of remote sensing data has benefited immensely from borrowing techniques from the broader field of computer vision, such as the use of shared models pre-trained on large and diverse datasets. However, satellite imagery has unique features that are not accounted for in traditional computer vision, such as the existence of multiple revisits of the same location. Here, we explore the best way to use revisits in the framework of fine-tuning pre-trained remote sensing models. We focus on an applied research question of relevance to climate change mitigation -- power substation segmentation -- that is representative of applied uses of pre-trained models more generally. Through extensive tests of different multi-temporal input schemes across diverse model architectures, we find that fusing representations from multiple revisits in the model latent space is superior to other methods of using revisits, including as a form of data augmentation. We also find that a SWIN Transformer-based architecture performs better than U-nets and ViT-based models. We verify the generality of our results on a separate building density estimation task.

Improving satellite imagery segmentation using multiple Sentinel-2 revisits

TL;DR

Abstract

Paper Structure (16 sections, 5 figures, 9 tables)

This paper contains 16 sections, 5 figures, 9 tables.

Introduction
Related Work
Methodology
Dataset
Multi-Temporal Input
Model Architectures
Results
SWIN outperforms ViT and U-Net
Incorporating multiple revisits improves performance
PhilEO-Downstream Dataset
Conclusion
Data Pre-processing Techniques
Multi-spectral Input
Incorporating more channels boosts performance
Impact of the size of pre-trained ViT encoder on model performance
...and 1 more sections

Figures (5)

Figure 1: Images from multiple revisits along with mask
Figure 2: Segmentation model with SWIN transformer backbone and multi-temporal input
Figure 3: U-Net with a ResNet50 backbone and multi-temporal input
Figure 4: Predictions by the SWIN model on the power-substation dataset
Figure 5: a) Architecture of decoder, comprising of multiple up-scaling blocks followed by a convolution layer, used in the ViT based model; b) Architecture of up-scaling block used in the decoder

Improving satellite imagery segmentation using multiple Sentinel-2 revisits

TL;DR

Abstract

Improving satellite imagery segmentation using multiple Sentinel-2 revisits

Authors

TL;DR

Abstract

Table of Contents

Figures (5)