Predicting Gradient is Better: Exploring Self-Supervised Learning for SAR ATR with a Joint-Embedding Predictive Architecture

Weijie Li; Yang Wei; Tianpeng Liu; Yuenan Hou; Yuxuan Li; Zhen Liu; Yongxiang Liu; Li Liu

Predicting Gradient is Better: Exploring Self-Supervised Learning for SAR ATR with a Joint-Embedding Predictive Architecture

Weijie Li, Yang Wei, Tianpeng Liu, Yuenan Hou, Yuxuan Li, Zhen Liu, Yongxiang Liu, Li Liu

TL;DR

This work tackles the problem of learning generic SAR ATR representations with self-supervised learning by introducing SAR-JEPA, a joint-embedding predictive architecture that combines local masked patches with a gradient-based target encoder to predict multi-scale SAR gradient features from unseen patches. By pretraining on large unlabeled SAR datasets and evaluating on multiple fine-grained downstream tasks, the authors demonstrate that integrating masked image modeling with physics-informed gradient targets yields superior representations, particularly as data volume grows. The approach addresses SAR-specific challenges such as speckle noise and small targets, and the results indicate potential for a foundation model for SAR ATR across targets, scenes, and sensors. The work suggests promising practical impact for low-label regimes and motivates further scaling with synthetic data and broader architecture exploration.

Abstract

The growing Synthetic Aperture Radar (SAR) data has the potential to build a foundation model through Self-Supervised Learning (SSL) methods, which can achieve various SAR Automatic Target Recognition (ATR) tasks with pre-training in large-scale unlabeled data and fine-tuning in small labeled samples. SSL aims to construct supervision signals directly from the data, which minimizes the need for expensive expert annotation and maximizes the use of the expanding data pool for a foundational model. This study investigates an effective SSL method for SAR ATR, which can pave the way for a foundation model in SAR ATR. The primary obstacles faced in SSL for SAR ATR are the small targets in remote sensing and speckle noise in SAR images, corresponding to the SSL approach and signals. To overcome these challenges, we present a novel Joint-Embedding Predictive Architecture for SAR ATR (SAR-JEPA), which leverages local masked patches to predict the multi-scale SAR gradient representations of unseen context. The key aspect of SAR-JEPA is integrating SAR domain features to ensure high-quality self-supervised signals as target features. Besides, we employ local masks and multi-scale features to accommodate the various small targets in remote sensing. By fine-tuning and evaluating our framework on three target recognition datasets (vehicle, ship, and aircraft) with four other datasets as pre-training, we demonstrate its outperformance over other SSL methods and its effectiveness with increasing SAR data. This study showcases the potential of SSL for SAR target recognition across diverse targets, scenes, and sensors.Our codes and weights are available in \url{https://github.com/waterdisappear/SAR-JEPA.

Predicting Gradient is Better: Exploring Self-Supervised Learning for SAR ATR with a Joint-Embedding Predictive Architecture

TL;DR

Abstract

Paper Structure (15 sections, 2 equations, 7 figures, 6 tables)

This paper contains 15 sections, 2 equations, 7 figures, 6 tables.

Introduction
Related Work
Self-supervised learning in computer vision
Self-supervised learning in remote sensing
Approach
Local Masked Patches
Target Encoder
Implementation
Experiments
Dataset and Experimental Settings
Ablation Study
Comparisons with other methods
Visualization
Scaling Experiment
Conclusion

Figures (7)

Figure 1: Comparison of related architectures and the proposed architecture. (a) Generative architecture he2022masked focuses on reconstructing the pixels of unseen patches with a high mask proportion. This approach creates a challenging and meaningful pretext task that allows the model to learn contextual relationships within images. (b) Physics-guided contrastive architecture (PGCA) datcu2023explainablehuang2022physically leverages the unique representation in the SAR domain as guided signals to constrain the representation of the deep neural network (DNN). By incorporating physical principles into the learning process, PGCA improves the ability of the model to capture accurate and important SAR features. (c) Image-based joint-embedding predictive architecture (I-JEPA) assran2023self uses deep features as target signals and learns more semantic information. However, we found that a learnable target encoder is susceptible to feature collapses due to SAR image noise. (d) Our joint-embedding predictive architecture for SAR ATR (SAR-JEPA) combines the SAR domain embedding and the meaningful masked image modeling (MIM) pretext task to learn contextual relationships in the SAR gradient feature space. This approach utilizes prior knowledge about SAR target recognition on target scale and features to improve representation.
Figure 2: Overall framework of SAR-JEPA. (a) In the pretraining stage, joint-embedding predictive architecture for SAR automatic target recognition (SAR-JEPA) uses local masked patches to predict the multi-scale SAR feature representations $f_{\rm SAR}$ of unseen patches. The MIM structure uses the Vision Transformer (ViT) in MAE to extract deep features $f_{\rm Deep}$ of masked patches. Its DNN predictor predicts the SAR features of unseen patches. The target encoder uses the GR method to map SAR images from pixel to feature space, thus extracting the target shapes and avoiding speckle noise interference in SAR images. Local patches and multi-scale features are designed for multi-scale small targets in remote sensing. For downstream datasets, we fine-tune the DNN encoder with a task layer using labeled training data. DNN encoder weights are frozen or fine-tuned using various tuning settings. The weights of the other model modules are removed. (c) The trained model is used to predict the test result.
Figure 3: Datasets for the pretraining and downstream tasks. Pretraining contains various targets, scenes, and sensors from MSAR, SAR-Ship-Dataset, SARSim, and SAMPLE. MSAR is the satellite-based dataset of ground and sea targets; SAR-Ship-Dataset is the satellite-based dataset of sea targets; SARSim is the multi-angle simulation dataset of vehicle targets; and SAMPLE is the simulated and measured dataset of vehicle targets. We use MSTAR, FUSAR-Ship, and SAR-ACD datasets to evaluate the performance in recognizing different targets. MSTAR is a fine-grained vehicle dataset; FUSAR-Ship is a sea target dataset; and SAR-ACD is a fine-grained aircraft dataset.
Figure 4: Multi-scale kernel settings for $\rm{GR_{lin}}$. Here, the scale 1/2/3/4 has $r$ equal to 5/9/13/17, and the multi-scale concat all four scales in the feature channel. Multi-scale is more suitable than single scale because of its various targets in remote sensing images.
Figure 5: Training loss and test accuracy curves for one fine-tuning of MSTAR 10-shot (Table \ref{['table_result']}). From the figure, effective SSL decreases the training loss rapidly and converges to a more generalized region.
...and 2 more figures

Predicting Gradient is Better: Exploring Self-Supervised Learning for SAR ATR with a Joint-Embedding Predictive Architecture

TL;DR

Abstract

Predicting Gradient is Better: Exploring Self-Supervised Learning for SAR ATR with a Joint-Embedding Predictive Architecture

Authors

TL;DR

Abstract

Table of Contents

Figures (7)