Enhancing SAR Object Detection with Self-Supervised Pre-training on Masked Auto-Encoders
Xinyang Pu, Feng Xu
TL;DR
This work tackles the domain gap between natural-image pretraining and SAR tasks by introducing self-supervised pre-training on SAR data using Masked Auto-Encoders (MAE). The MAE pre-trains a Vision Transformer encoder on masked SAR patches, after which the encoder serves as the backbone in a ViTDet-based detector with FPN and RoI head for SAR object detection. On the SARDet-100k benchmark, the SAR-domain SSL pre-training outperforms both ImageNet pre-training and training from scratch, achieving a notable 1.3 mAP improvement over SFT alone and demonstrating improved generalization to downstream tasks. The results support the practicality of domain-specific, SSL-based backbone pre-training for SAR imagery, reducing dependence on natural-image priors and enhancing performance on large-scale SAR detection tasks.
Abstract
Supervised fine-tuning methods (SFT) perform great efficiency on artificial intelligence interpretation in SAR images, leveraging the powerful representation knowledge from pre-training models. Due to the lack of domain-specific pre-trained backbones in SAR images, the traditional strategies are loading the foundation pre-train models of natural scenes such as ImageNet, whose characteristics of images are extremely different from SAR images. This may hinder the model performance on downstream tasks when adopting SFT on small-scale annotated SAR data. In this paper, an self-supervised learning (SSL) method of masked image modeling based on Masked Auto-Encoders (MAE) is proposed to learn feature representations of SAR images during the pre-training process and benefit the object detection task in SAR images of SFT. The evaluation experiments on the large-scale SAR object detection benchmark named SARDet-100k verify that the proposed method captures proper latent representations of SAR images and improves the model generalization in downstream tasks by converting the pre-trained domain from natural scenes to SAR images through SSL. The proposed method achieves an improvement of 1.3 mAP on the SARDet-100k benchmark compared to only the SFT strategies.
