Table of Contents
Fetching ...

GeoFormer: A Multi-Polygon Segmentation Transformer

Maxim Khomiakov, Michael Riis Andersen, Jes Frellsen

TL;DR

This study presents the first successful application of auto-regressive transformer models for multi-polygon predictions in remote sensing, suggesting a promising methodological alternative for building vectorization.

Abstract

In remote sensing there exists a common need for learning scale invariant shapes of objects like buildings. Prior works relies on tweaking multiple loss functions to convert segmentation maps into the final scale invariant representation, necessitating arduous design and optimization. For this purpose we introduce the GeoFormer, a novel architecture which presents a remedy to the said challenges, learning to generate multipolygons end-to-end. By modeling keypoints as spatially dependent tokens in an auto-regressive manner, the GeoFormer outperforms existing works in delineating building objects from satellite imagery. We evaluate the robustness of the GeoFormer against former methods through a variety of parameter ablations and highlight the advantages of optimizing a single likelihood function. Our study presents the first successful application of auto-regressive transformer models for multi-polygon predictions in remote sensing, suggesting a promising methodological alternative for building vectorization.

GeoFormer: A Multi-Polygon Segmentation Transformer

TL;DR

This study presents the first successful application of auto-regressive transformer models for multi-polygon predictions in remote sensing, suggesting a promising methodological alternative for building vectorization.

Abstract

In remote sensing there exists a common need for learning scale invariant shapes of objects like buildings. Prior works relies on tweaking multiple loss functions to convert segmentation maps into the final scale invariant representation, necessitating arduous design and optimization. For this purpose we introduce the GeoFormer, a novel architecture which presents a remedy to the said challenges, learning to generate multipolygons end-to-end. By modeling keypoints as spatially dependent tokens in an auto-regressive manner, the GeoFormer outperforms existing works in delineating building objects from satellite imagery. We evaluate the robustness of the GeoFormer against former methods through a variety of parameter ablations and highlight the advantages of optimizing a single likelihood function. Our study presents the first successful application of auto-regressive transformer models for multi-polygon predictions in remote sensing, suggesting a promising methodological alternative for building vectorization.

Paper Structure

This paper contains 23 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of the GeoFormer model architecture. On the left hand side an image is patch embedded and parsed through all four SWINv2 liu2022swin layers, whilst each layer skips forward the feature representation which is then convolved and upsampled to match the hidden dimensions of the decoder. On the right hand side our auto-regressive decoder, which takes as input a flattened sequence of spatial tokens together with 3 special tokens.
  • Figure 2: Qualitative results of model predictions on test set images together with ground truth. Columns from left to right: ground truth, FFL girard2021polygonal, PolyWorld zorzi2022polyworld, HiSup xu2023hisup and GeoFormer (ours).
  • Figure 3: Visual examples of perturbations performed to input images in the robustness studies. From top row: downsampling, erased dropout, and rotations.
  • Figure 4: Performance relative to perturbations performed on the Aicrowd small dataset. We perform downsampling, rotations and random dropout. For downsampling a perturbation factor of 2 would equate to a 2x lower spatial resolution, while for dropout, each perturbation factor corresponds to $3\% \times$perturbation factor of pixels that are erased, while for the rotations the perturbation factor is the angles by which the input is rotated.
  • Figure 5: Visualisation of the attention maps on top of the input image and predicted polygons for pairs of tokens $s_{t:t+1}$ from the final layer of the GeoFormer decoder.