Table of Contents
Fetching ...

VistaFormer: Scalable Vision Transformers for Satellite Image Time Series Segmentation

Ezra MacDonald, Derek Jacoby, Yvonne Coady

TL;DR

This work introduces VistaFormer, a lightweight Transformer-based model architecture for the semantic segmentation of remote-sensing images that uses position-free self-attention layers which simplifies the model architecture and removes the need to interpolate temporal and spatial codes.

Abstract

We introduce VistaFormer, a lightweight Transformer-based model architecture for the semantic segmentation of remote-sensing images. This model uses a multi-scale Transformer-based encoder with a lightweight decoder that aggregates global and local attention captured in the encoder blocks. VistaFormer uses position-free self-attention layers which simplifies the model architecture and removes the need to interpolate temporal and spatial codes, which can reduce model performance when training and testing image resolutions differ. We investigate simple techniques for filtering noisy input signals like clouds and demonstrate that improved model scalability can be achieved by substituting Multi-Head Self-Attention (MHSA) with Neighbourhood Attention (NA). Experiments on the PASTIS and MTLCC crop-type segmentation benchmarks show that VistaFormer achieves better performance than comparable models and requires only 8% of the floating point operations using MHSA and 11% using NA while also using fewer trainable parameters. VistaFormer with MHSA improves on state-of-the-art mIoU scores by 0.1% on the PASTIS benchmark and 3% on the MTLCC benchmark while VistaFormer with NA improves on the MTLCC benchmark by 3.7%.

VistaFormer: Scalable Vision Transformers for Satellite Image Time Series Segmentation

TL;DR

This work introduces VistaFormer, a lightweight Transformer-based model architecture for the semantic segmentation of remote-sensing images that uses position-free self-attention layers which simplifies the model architecture and removes the need to interpolate temporal and spatial codes.

Abstract

We introduce VistaFormer, a lightweight Transformer-based model architecture for the semantic segmentation of remote-sensing images. This model uses a multi-scale Transformer-based encoder with a lightweight decoder that aggregates global and local attention captured in the encoder blocks. VistaFormer uses position-free self-attention layers which simplifies the model architecture and removes the need to interpolate temporal and spatial codes, which can reduce model performance when training and testing image resolutions differ. We investigate simple techniques for filtering noisy input signals like clouds and demonstrate that improved model scalability can be achieved by substituting Multi-Head Self-Attention (MHSA) with Neighbourhood Attention (NA). Experiments on the PASTIS and MTLCC crop-type segmentation benchmarks show that VistaFormer achieves better performance than comparable models and requires only 8% of the floating point operations using MHSA and 11% using NA while also using fewer trainable parameters. VistaFormer with MHSA improves on state-of-the-art mIoU scores by 0.1% on the PASTIS benchmark and 3% on the MTLCC benchmark while VistaFormer with NA improves on the MTLCC benchmark by 3.7%.
Paper Structure (20 sections, 4 equations, 4 figures, 4 tables)

This paper contains 20 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The VistaFormer model architecture uses a three-layer encoder-decoder architecture where the encoder blocks downsample inputs and computes self-attention while the decoder blocks are comprised of lightweight upsampling layers that unify features from the encoder outputs to generate dense predictions.
  • Figure 2: (a) Each encoder block downsamples inputs using gated convolutions to reduce atmospheric distortions, reshapes them into sequences of tokens, and processes them through self-attention Transformer layers. (b) The use of gated convolutions implemented here enhances the model's resilience to obstructions like clouds present in input samples. (c) The decoder block uses trilinear upsampling and a 1D convolution to extract features and align embedding dimensions producing a dense prediction.
  • Figure 3: VistaFormer sample semantic segmentation predictions on the PASTIS benchmark. Under titles $T=0, ..., 3$, we show samples of input RGB channels and include these alongside ground truth annotations, model predictions, attention maps, and Monte Carlo dropout gal_dropout_2016 predictions to measure the uncertainty of model predictions. We use the dropout settings used during training for Monte Carlo Dropout and the outputs reflect the model certainty measure over 10 iterations.
  • Figure 4: (a) Shows the scaling of floating point operations in GFLOPs of VistaFormer with MHSA and NA respectively compared with TSViT and U-TAE models using an input dimension of $(B, C, T, H, W) = (4, 10, 30, x_i, x_i)$ where we scale height and width dimensions using $x_i$. (b) Reflects the scaling of VistaFormer in terms of GFLOPs using input dimensions $(B, C, T, H, W) = (4, 10, t_i, 64, 64)$ where we scale $x_i$.