Table of Contents
Fetching ...

SAR Object Detection with Self-Supervised Pretraining and Curriculum-Aware Sampling

Yasin Almalioglu, Andrzej Kucik, Geoffrey French, Dafni Antotsiou, Alexander Adam, Cedric Archambeau

TL;DR

The paper addresses the challenge of detecting small objects in satellite SAR imagery under limited annotations. Its core method, TRANSAR, combines self-supervised masked image modeling on unlabeled SAR data with an auxiliary binary semantic segmentation task and an adaptive sampling scheduler to address extreme class imbalance. During pretraining, a CNN-based pixel-shuffling head learns SAR representations, while fine-tuning uses a Gaussian-blob heatmap target for object centers and a loss combining BCE and Dice terms: $L = \alpha \mathrm{BCE}(y, \hat{y}) + \beta \mathrm{Dice}(y, \hat{y})$. Experiments on Capella Space datasets show state-of-the-art mean average precision ($mAP$) and robust F1, with ablations confirming the importance of SSL pretraining and the adaptive sampling strategy for handling highly imbalanced, tiny-object detection in SAR.

Abstract

Object detection in satellite-borne Synthetic Aperture Radar (SAR) imagery holds immense potential in tasks such as urban monitoring and disaster response. However, the inherent complexities of SAR data and the scarcity of annotations present significant challenges in the advancement of object detection in this domain. Notably, the detection of small objects in satellite-borne SAR images poses a particularly intricate problem, because of the technology's relatively low spatial resolution and inherent noise. Furthermore, the lack of large labelled SAR datasets hinders the development of supervised deep learning-based object detection models. In this paper, we introduce TRANSAR, a novel self-supervised end-to-end vision transformer-based SAR object detection model that incorporates masked image pre-training on an unlabeled SAR image dataset that spans more than $25,700$ km\textsuperscript{2} ground area. Unlike traditional object detection formulation, our approach capitalises on auxiliary binary semantic segmentation, designed to segregate objects of interest during the post-tuning, especially the smaller ones, from the background. In addition, to address the innate class imbalance due to the disproportion of the object to the image size, we introduce an adaptive sampling scheduler that dynamically adjusts the target class distribution during training based on curriculum learning and model feedback. This approach allows us to outperform conventional supervised architecture such as DeepLabv3 or UNet, and state-of-the-art self-supervised learning-based arhitectures such as DPT, SegFormer or UperNet, as shown by extensive evaluations on benchmark SAR datasets.

SAR Object Detection with Self-Supervised Pretraining and Curriculum-Aware Sampling

TL;DR

The paper addresses the challenge of detecting small objects in satellite SAR imagery under limited annotations. Its core method, TRANSAR, combines self-supervised masked image modeling on unlabeled SAR data with an auxiliary binary semantic segmentation task and an adaptive sampling scheduler to address extreme class imbalance. During pretraining, a CNN-based pixel-shuffling head learns SAR representations, while fine-tuning uses a Gaussian-blob heatmap target for object centers and a loss combining BCE and Dice terms: . Experiments on Capella Space datasets show state-of-the-art mean average precision () and robust F1, with ablations confirming the importance of SSL pretraining and the adaptive sampling strategy for handling highly imbalanced, tiny-object detection in SAR.

Abstract

Object detection in satellite-borne Synthetic Aperture Radar (SAR) imagery holds immense potential in tasks such as urban monitoring and disaster response. However, the inherent complexities of SAR data and the scarcity of annotations present significant challenges in the advancement of object detection in this domain. Notably, the detection of small objects in satellite-borne SAR images poses a particularly intricate problem, because of the technology's relatively low spatial resolution and inherent noise. Furthermore, the lack of large labelled SAR datasets hinders the development of supervised deep learning-based object detection models. In this paper, we introduce TRANSAR, a novel self-supervised end-to-end vision transformer-based SAR object detection model that incorporates masked image pre-training on an unlabeled SAR image dataset that spans more than km\textsuperscript{2} ground area. Unlike traditional object detection formulation, our approach capitalises on auxiliary binary semantic segmentation, designed to segregate objects of interest during the post-tuning, especially the smaller ones, from the background. In addition, to address the innate class imbalance due to the disproportion of the object to the image size, we introduce an adaptive sampling scheduler that dynamically adjusts the target class distribution during training based on curriculum learning and model feedback. This approach allows us to outperform conventional supervised architecture such as DeepLabv3 or UNet, and state-of-the-art self-supervised learning-based arhitectures such as DPT, SegFormer or UperNet, as shown by extensive evaluations on benchmark SAR datasets.

Paper Structure

This paper contains 21 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The proposed SSL SAR object detection pipeline with adaptive sampling. Adaptive sampling balances foreground and background in each batch, guided by prediction performance. The vision transformer processes image patches, embedding them with positional encoding. Lightweight prediction heads handle object map predictions during pretraining and reconstruction during fine-tuning.
  • Figure 2: Example qualitative SAR object detection results. a. Fine-grained detection of small objects. The supervised model fails at distinguishing the concentrated target objects b. Robust to false reflective objects. The supervised model generates false positives. c. Precise detection. The supervised model has mixed false and true predictions. d. Similar performance in the rural areas where reflective objects are distinct from the background.
  • Figure 3: Example normalised sampling distribution of foreground (positive) and background (negative) samples in different precision performances. The sampler frequently draws foreground samples in the early epochs and switches to background samples to improve the precision. a. Constantly improving precision. b. Precision stalls after epoch 100 and the distribution shifts towards heavy negative sampling.
  • Figure 4: Example detection results in urban areas. Highly reflective objects pose a significant challenge for the models. Both TRANSAR and supervised approaches generate false predictions as shown in the chips.
  • Figure 5: Sensitivity analysis on auxiliary segmentation task in terms of precision recall curves. a) NMS distance. b) Confidence threshold. c) Hit distance.