SARFormer -- An Acquisition Parameter Aware Vision Transformer for Synthetic Aperture Radar Data
Jonathan Prexl, Michael Recla, Michael Schmitt
TL;DR
SARFormer addresses the challenge of SAR imagery with complex acquisition geometry by introducing an Acquisition Parameter Encoding (APE) that conditions a Vision Transformer on per-view parameters. It couples a multi-view fusion strategy with MAE-based self-supervised pre-training, using three masking schemes to learn robust representations for downstream tasks such as height reconstruction (map and image geometry) and building-footprint segmentation. Across experiments on TerraSAR-X data, SARFormer with APE and domain-adapted pre-training achieves notable gains over CNN- and ViT-based baselines, including improvements under limited-label conditions, highlighting its potential as a SAR-focused foundation model. The work demonstrates the value of sensor-aware, geometry-conscious architectures for remote sensing and points to broad applicability across SAR missions and multi-task objectives.
Abstract
This manuscript introduces SARFormer, a modified Vision Transformer (ViT) architecture designed for processing one or multiple synthetic aperture radar (SAR) images. Given the complex image geometry of SAR data, we propose an acquisition parameter encoding module that significantly guides the learning process, especially in the case of multiple images, leading to improved performance on downstream tasks. We further explore self-supervised pre-training, conduct experiments with limited labeled data, and benchmark our contribution and adaptations thoroughly in ablation experiments against a baseline, where the model is tested on tasks such as height reconstruction and segmentation. Our approach achieves up to 17% improvement in terms of RMSE over baseline models
