Table of Contents
Fetching ...

SARFormer -- An Acquisition Parameter Aware Vision Transformer for Synthetic Aperture Radar Data

Jonathan Prexl, Michael Recla, Michael Schmitt

TL;DR

SARFormer addresses the challenge of SAR imagery with complex acquisition geometry by introducing an Acquisition Parameter Encoding (APE) that conditions a Vision Transformer on per-view parameters. It couples a multi-view fusion strategy with MAE-based self-supervised pre-training, using three masking schemes to learn robust representations for downstream tasks such as height reconstruction (map and image geometry) and building-footprint segmentation. Across experiments on TerraSAR-X data, SARFormer with APE and domain-adapted pre-training achieves notable gains over CNN- and ViT-based baselines, including improvements under limited-label conditions, highlighting its potential as a SAR-focused foundation model. The work demonstrates the value of sensor-aware, geometry-conscious architectures for remote sensing and points to broad applicability across SAR missions and multi-task objectives.

Abstract

This manuscript introduces SARFormer, a modified Vision Transformer (ViT) architecture designed for processing one or multiple synthetic aperture radar (SAR) images. Given the complex image geometry of SAR data, we propose an acquisition parameter encoding module that significantly guides the learning process, especially in the case of multiple images, leading to improved performance on downstream tasks. We further explore self-supervised pre-training, conduct experiments with limited labeled data, and benchmark our contribution and adaptations thoroughly in ablation experiments against a baseline, where the model is tested on tasks such as height reconstruction and segmentation. Our approach achieves up to 17% improvement in terms of RMSE over baseline models

SARFormer -- An Acquisition Parameter Aware Vision Transformer for Synthetic Aperture Radar Data

TL;DR

SARFormer addresses the challenge of SAR imagery with complex acquisition geometry by introducing an Acquisition Parameter Encoding (APE) that conditions a Vision Transformer on per-view parameters. It couples a multi-view fusion strategy with MAE-based self-supervised pre-training, using three masking schemes to learn robust representations for downstream tasks such as height reconstruction (map and image geometry) and building-footprint segmentation. Across experiments on TerraSAR-X data, SARFormer with APE and domain-adapted pre-training achieves notable gains over CNN- and ViT-based baselines, including improvements under limited-label conditions, highlighting its potential as a SAR-focused foundation model. The work demonstrates the value of sensor-aware, geometry-conscious architectures for remote sensing and points to broad applicability across SAR missions and multi-task objectives.

Abstract

This manuscript introduces SARFormer, a modified Vision Transformer (ViT) architecture designed for processing one or multiple synthetic aperture radar (SAR) images. Given the complex image geometry of SAR data, we propose an acquisition parameter encoding module that significantly guides the learning process, especially in the case of multiple images, leading to improved performance on downstream tasks. We further explore self-supervised pre-training, conduct experiments with limited labeled data, and benchmark our contribution and adaptations thoroughly in ablation experiments against a baseline, where the model is tested on tasks such as height reconstruction and segmentation. Our approach achieves up to 17% improvement in terms of RMSE over baseline models

Paper Structure

This paper contains 12 sections, 14 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Schematic illustration of the SAR imaging acquisition geometry and modes. Top left: Plane along the flight direction, showing the viewing angle $\theta$ for both descending and ascending orbits. Top right: Azimuth angle $Az$, measured relative to true north. Bottom left: Acquisition modes for a SAR sensor, each accompanied by a visual example demonstrating data captured over the Eiffel Tower. Bottom right: Example output of one of the downstream tasks (height reconstruction) using the proposed SARFormer architecture.
  • Figure 2: The proposed architecture modifications and the corresponding nomenclature for the pre-training scenario (top) and the fine-tuning stage carried out in a multi-task manner. The two scenarios drawn here correspond to a two-view case but can be generalized to one or multiple views as done in the experimental section of this work.
  • Figure 3: Comparison between the outputs of different models (see labels) next to the input image, an aerial view, and ground truth. Especially building shapes seem to benefit from pre-training. Please find more visual examples in the supplementary materials.
  • Figure 4: Comparison between the four different imaging modes being used in this work. The images depict the same scene captured with similar viewing angles but in different imaging modes.
  • Figure 5: Comparison of the different SAR image geometries. Each row shows a SAR image of the same area taken from different directions, along with the corresponding height above ground values. Columns 1 and 2 display the images in their native slant-range geometry, where the columns represent the distance to the sensor from left to right. In Columns 2 and 3, the images are projected onto a terrain model, making each pixel correspond to one meter on the Earth's surface. The far-right column shows the height values in a map projection, independent of the image geometries, and thus identical for both acquisitions.
  • ...and 5 more figures