SARATR-X: Toward Building A Foundation Model for SAR Target Recognition

Weijie Li; Wei Yang; Yuenan Hou; Li Liu; Yongxiang Liu; Xiang Li

SARATR-X: Toward Building A Foundation Model for SAR Target Recognition

Weijie Li, Wei Yang, Yuenan Hou, Li Liu, Yongxiang Liu, Xiang Li

TL;DR

This work addresses the lack of foundation models for SAR ATR by introducing SARATR-X, a self-supervised foundation model trained on a large unlabeled SAR corpus to enable scalable, label-efficient adaptation across SAR target recognition tasks.A diverse pre-training dataset (SARDet-180K) aggregates 186,600 SAR target samples from 14 open-source datasets, spanning multiple targets, scenes, and sensors, to support broad SAR generalization.SARATR-X uses a SAR-tailored HiViT backbone and a two-step pre-training pipeline (SSL-ImageNet initialization followed by SAR-focused masked image modeling with multi-scale gradient features) to mitigate speckle noise and preserve small-target information.Evaluations demonstrate strong performance on few-shot classification, robustness across operating conditions, and multi-dataset detection, often rivaling or surpassing existing supervised, semi-supervised, or self-supervised methods, with results and code released publicly to spur further SAR foundation-model research.

Abstract

Despite the remarkable progress in synthetic aperture radar automatic target recognition (SAR ATR), recent efforts have concentrated on detecting and classifying a specific category, e.g., vehicles, ships, airplanes, or buildings. One of the fundamental limitations of the top-performing SAR ATR methods is that the learning paradigm is supervised, task-specific, limited-category, closed-world learning, which depends on massive amounts of accurately annotated samples that are expensively labeled by expert SAR analysts and have limited generalization capability and scalability. In this work, we make the first attempt towards building a foundation model for SAR ATR, termed SARATR-X. SARATR-X learns generalizable representations via self-supervised learning (SSL) and provides a cornerstone for label-efficient model adaptation to generic SAR target detection and classification tasks. Specifically, SARATR-X is trained on 0.18 M unlabelled SAR target samples, which are curated by combining contemporary benchmarks and constitute the largest publicly available dataset till now. Considering the characteristics of SAR images, a backbone tailored for SAR ATR is carefully designed, and a two-step SSL method endowed with multi-scale gradient features was applied to ensure the feature diversity and model scalability of SARATR-X. The capabilities of SARATR-X are evaluated on classification under few-shot and robustness settings and detection across various categories and scenes, and impressive performance is achieved, often competitive with or even superior to prior fully supervised, semi-supervised, or self-supervised algorithms. Our SARATR-X and the curated dataset are released at https://github.com/waterdisappear/SARATR-X to foster research into foundation models for SAR image interpretation.

SARATR-X: Toward Building A Foundation Model for SAR Target Recognition

TL;DR

Abstract

Paper Structure (24 sections, 2 equations, 10 figures, 12 tables)

This paper contains 24 sections, 2 equations, 10 figures, 12 tables.

Introduction
Related Work
Foundation models for remote sensing
Related SSL in SAR
Approach
Pre-training Dataset
Model Architecture
Proposed Pre-training Method
Evaluation with Recognition Tasks
SARATR-X Experiments
Comparison of Model Backbones
Strategy of two-step pre-training
Design of target signals for SAR images
Analysis
Leveraging SARATR-X for Recognition
...and 9 more sections

Figures (10)

Figure 1: Various specialized SAR ATR datasets and tasks. SAR ATR includes various imaging conditions (i.e. operating condition), such as targets, scenes, and sensors. However, the datasets are often collected in specific settings for certain tasks due to high costs. For example, MSTAR MSTAR is a ten-type vehicle target classification dataset in the X-band and grass scenarios, and SAR-Aircraft is a seven-type aircraft detection dataset collected from three airports and a C-band satellite. Specialized algorithms have been proposed for these datasets. However, the differing target characteristics, scene information, and sensor parameters have complicated the generalization of existing algorithms. As such, this paper aims to develop a SAR ATR foundation model, a generalized method for conducting various tasks.
Figure 2: Results on classification and detection tasks. SARATR-X performed well across 5 datasets with 8 settings. It was superior to existing SSL methods (BIDFC zhai2022weakly) for target classification in the fine-grained vehicle MSTAR dataset MSTAR with a few-shot setting. In addition, it performed well under extended operating conditions (EOCs) zhang2021domain (i.e., imaging conditions with variable depression angle (EOCs-Depression), target configuration (EOCs-Config), and version (EOCs-Version)). SARATR-X also demonstrated competitive object detection performance with existing supervised methods applied to various categories (SARDet-100K li2024sardet100k and OGSOD wang2023category), as well as specific categories for ships (SSDD zhang2021sar) and aircraft (SAR-AIRcraft wang2023sar). Our study shows the potential of a foundation model for SAR ATR.
Figure 3: Two-step pre-training process. The first involved performing MIM on ImageNet data to obtain better initialization weights for model diversity, as shown in Fig. \ref{['visual_attention_distance']} (c). The second involved performing MIM on SAR images with high-quality guide signals that are multi-scale gradient features suppressing speckle noise and extracting target edges.
Figure 4: Discussions of single and multi-scale kernel settings for MGF. Here, the scale 1/2/3 assumes $r$ equal to 9/13/17, as the multi-scale contacts all scales. This multi-scale approach is more suitable than a single-scale technique for various targets in remote sensing images.
Figure 5: Averaged attention distances for various attention heads (the x-axis is the attention head w.r.t layer number, and point colors represent different layers for better visualization) in the SSL models. Attention distance represents the range of a receptive field. We focused specifically on model architectures (Fig. (a) v.s. Fig. (b)), initialization weights (Fig. (a) v.s. Fig. (c)), and SSL signals (Fig. (d) v.s. Fig. (e)) to ensure diverse attention ranges for SAR target recognition, including the HiViT architecture, ImageNet weights, and SAR target features.
...and 5 more figures

SARATR-X: Toward Building A Foundation Model for SAR Target Recognition

TL;DR

Abstract

SARATR-X: Toward Building A Foundation Model for SAR Target Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (10)