Table of Contents
Fetching ...

Enhancing Crop Segmentation in Satellite Image Time Series with Transformer Networks

Ignazio Gallo, Mattia Gatti, Nicola Landro, Christian Loschiavo, Mirco Boschetti, Riccardo La Grassa

TL;DR

The paper investigates whether transformer-based encoders can outperform CNNs for crop segmentation in Satellite Image Time Series (SITS) and adapts the Swin UNETR model to this domain. By reconfiguring Swin UNETR to process temporal multispectral Sentinel-2 data, the authors demonstrate state-of-the-art performance on the Munich dataset (val OA ≈ 96.14%, test OA ≈ 95.26%) and competitive results on Lombardia, while hinting at reduced training times compared to CNN baselines. These findings indicate transformer architectures can effectively capture long-range temporal-spatial dependencies in SITS and may offer efficiency benefits for remote sensing crop mapping. The work opens avenues for applying transformers to broader geospatial tasks beyond crop segmentation.

Abstract

Recent studies have shown that Convolutional Neural Networks (CNNs) achieve impressive results in crop segmentation of Satellite Image Time Series (SITS). However, the emergence of transformer networks in various vision tasks raises the question of whether they can outperform CNNs in this task as well. This paper presents a revised version of the Transformer-based Swin UNETR model, specifically adapted for crop segmentation of SITS. The proposed model demonstrates significant advancements, achieving a validation accuracy of 96.14% and a test accuracy of 95.26% on the Munich dataset, surpassing the previous best results of 93.55% for validation and 92.94% for the test. Additionally, the model's performance on the Lombardia dataset is comparable to UNet3D and superior to FPN and DeepLabV3. Experiments of this study indicate that the model will likely achieve comparable or superior accuracy to CNNs while requiring significantly less training time. These findings highlight the potential of transformer-based architectures for crop segmentation in SITS, opening new avenues for remote sensing applications.

Enhancing Crop Segmentation in Satellite Image Time Series with Transformer Networks

TL;DR

The paper investigates whether transformer-based encoders can outperform CNNs for crop segmentation in Satellite Image Time Series (SITS) and adapts the Swin UNETR model to this domain. By reconfiguring Swin UNETR to process temporal multispectral Sentinel-2 data, the authors demonstrate state-of-the-art performance on the Munich dataset (val OA ≈ 96.14%, test OA ≈ 95.26%) and competitive results on Lombardia, while hinting at reduced training times compared to CNN baselines. These findings indicate transformer architectures can effectively capture long-range temporal-spatial dependencies in SITS and may offer efficiency benefits for remote sensing crop mapping. The work opens avenues for applying transformers to broader geospatial tasks beyond crop segmentation.

Abstract

Recent studies have shown that Convolutional Neural Networks (CNNs) achieve impressive results in crop segmentation of Satellite Image Time Series (SITS). However, the emergence of transformer networks in various vision tasks raises the question of whether they can outperform CNNs in this task as well. This paper presents a revised version of the Transformer-based Swin UNETR model, specifically adapted for crop segmentation of SITS. The proposed model demonstrates significant advancements, achieving a validation accuracy of 96.14% and a test accuracy of 95.26% on the Munich dataset, surpassing the previous best results of 93.55% for validation and 92.94% for the test. Additionally, the model's performance on the Lombardia dataset is comparable to UNet3D and superior to FPN and DeepLabV3. Experiments of this study indicate that the model will likely achieve comparable or superior accuracy to CNNs while requiring significantly less training time. These findings highlight the potential of transformer-based architectures for crop segmentation in SITS, opening new avenues for remote sensing applications.

Paper Structure

This paper contains 5 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Proposed adaptation of the Swin UNETR to use with Sentinel 2 time-series.
  • Figure 2: Three random samples of input-output pairs from the Munich dataset. On the top, the input is shown as an RGB image. On the bottom, the output shows the class labels as colors.
  • Figure 3: Three random samples of input-output pairs from the Lombardia dataset. On the top, the input is shown as an RGB image (only one image out of 32 was shown for simplicity). On the bottom, the output shows the class labels as colors.
  • Figure 4: An example of a good prediction made by the Swin UNETR model on the Munich dataset.
  • Figure 5: An example of a bad prediction made by the Swin UNETR model on the Munich dataset.
  • ...and 3 more figures