Table of Contents
Fetching ...

Enhancing SAR Object Detection with Self-Supervised Pre-training on Masked Auto-Encoders

Xinyang Pu, Feng Xu

TL;DR

This work tackles the domain gap between natural-image pretraining and SAR tasks by introducing self-supervised pre-training on SAR data using Masked Auto-Encoders (MAE). The MAE pre-trains a Vision Transformer encoder on masked SAR patches, after which the encoder serves as the backbone in a ViTDet-based detector with FPN and RoI head for SAR object detection. On the SARDet-100k benchmark, the SAR-domain SSL pre-training outperforms both ImageNet pre-training and training from scratch, achieving a notable 1.3 mAP improvement over SFT alone and demonstrating improved generalization to downstream tasks. The results support the practicality of domain-specific, SSL-based backbone pre-training for SAR imagery, reducing dependence on natural-image priors and enhancing performance on large-scale SAR detection tasks.

Abstract

Supervised fine-tuning methods (SFT) perform great efficiency on artificial intelligence interpretation in SAR images, leveraging the powerful representation knowledge from pre-training models. Due to the lack of domain-specific pre-trained backbones in SAR images, the traditional strategies are loading the foundation pre-train models of natural scenes such as ImageNet, whose characteristics of images are extremely different from SAR images. This may hinder the model performance on downstream tasks when adopting SFT on small-scale annotated SAR data. In this paper, an self-supervised learning (SSL) method of masked image modeling based on Masked Auto-Encoders (MAE) is proposed to learn feature representations of SAR images during the pre-training process and benefit the object detection task in SAR images of SFT. The evaluation experiments on the large-scale SAR object detection benchmark named SARDet-100k verify that the proposed method captures proper latent representations of SAR images and improves the model generalization in downstream tasks by converting the pre-trained domain from natural scenes to SAR images through SSL. The proposed method achieves an improvement of 1.3 mAP on the SARDet-100k benchmark compared to only the SFT strategies.

Enhancing SAR Object Detection with Self-Supervised Pre-training on Masked Auto-Encoders

TL;DR

This work tackles the domain gap between natural-image pretraining and SAR tasks by introducing self-supervised pre-training on SAR data using Masked Auto-Encoders (MAE). The MAE pre-trains a Vision Transformer encoder on masked SAR patches, after which the encoder serves as the backbone in a ViTDet-based detector with FPN and RoI head for SAR object detection. On the SARDet-100k benchmark, the SAR-domain SSL pre-training outperforms both ImageNet pre-training and training from scratch, achieving a notable 1.3 mAP improvement over SFT alone and demonstrating improved generalization to downstream tasks. The results support the practicality of domain-specific, SSL-based backbone pre-training for SAR imagery, reducing dependence on natural-image priors and enhancing performance on large-scale SAR detection tasks.

Abstract

Supervised fine-tuning methods (SFT) perform great efficiency on artificial intelligence interpretation in SAR images, leveraging the powerful representation knowledge from pre-training models. Due to the lack of domain-specific pre-trained backbones in SAR images, the traditional strategies are loading the foundation pre-train models of natural scenes such as ImageNet, whose characteristics of images are extremely different from SAR images. This may hinder the model performance on downstream tasks when adopting SFT on small-scale annotated SAR data. In this paper, an self-supervised learning (SSL) method of masked image modeling based on Masked Auto-Encoders (MAE) is proposed to learn feature representations of SAR images during the pre-training process and benefit the object detection task in SAR images of SFT. The evaluation experiments on the large-scale SAR object detection benchmark named SARDet-100k verify that the proposed method captures proper latent representations of SAR images and improves the model generalization in downstream tasks by converting the pre-trained domain from natural scenes to SAR images through SSL. The proposed method achieves an improvement of 1.3 mAP on the SARDet-100k benchmark compared to only the SFT strategies.
Paper Structure (15 sections, 1 equation, 3 figures, 1 table)

This paper contains 15 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: The architecture of the proposed method, including self-supervised pre-training and supervised fine-tuning.
  • Figure 2: Reconstruction results of the MAE model on the validation set. Row 1: Masked images. Row 2: Reconstructed images. Row 3: Ground truth.
  • Figure 3: Detection results of the ViTDet pre-trained on SARDet-100k. Row 1: Ground truth. Row 2: Predicted result. Boxes in different colors indicate different categories.