Table of Contents
Fetching ...

FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery

Xiaokun Zhang, Yi Yang, Ziqi Ye, Baiyun, Xiaorong Guo, Qingchen Fang, Ruyi Zhang, Xinpeng Zhou, Haipeng Wang

TL;DR

This work constructed the inaugural SAR Image-Text-AlphaEarth feature triplet dataset and developed FUSAR-GPT, a VLM specifically for SAR that innovatively introduces a geospatial baseline model as a'world knowledge'prior and embeds multi-source remote-sensing temporal features into the model's visual backbone via'spatiotemporal anchors', enabling dynamic compensation for the sparse representation of targets in SAR images.

Abstract

Research on the intelligent interpretation of all-weather, all-time Synthetic Aperture Radar (SAR) is crucial for advancing remote sensing applications. In recent years, although Visual Language Models (VLMs) have demonstrated strong open-world understanding capabilities on RGB images, their performance is severely limited when directly applied to the SAR field due to the complexity of the imaging mechanism, sensitivity to scattering features, and the scarcity of high-quality text corpora. To systematically address this issue, we constructed the inaugural SAR Image-Text-AlphaEarth feature triplet dataset and developed FUSAR-GPT, a VLM specifically for SAR. FUSAR-GPT innovatively introduces a geospatial baseline model as a 'world knowledge' prior and embeds multi-source remote-sensing temporal features into the model's visual backbone via 'spatiotemporal anchors', enabling dynamic compensation for the sparse representation of targets in SAR images. Furthermore, we designed a two-stage SFT strategy to decouple the knowledge injection and task execution of large models. The spatiotemporal feature embedding and the two-stage decoupling paradigm enable FUSAR-GPT to achieve state-of-the-art performance across several typical remote sensing visual-language benchmark tests, significantly outperforming mainstream baseline models by over 12%.

FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery

TL;DR

This work constructed the inaugural SAR Image-Text-AlphaEarth feature triplet dataset and developed FUSAR-GPT, a VLM specifically for SAR that innovatively introduces a geospatial baseline model as a'world knowledge'prior and embeds multi-source remote-sensing temporal features into the model's visual backbone via'spatiotemporal anchors', enabling dynamic compensation for the sparse representation of targets in SAR images.

Abstract

Research on the intelligent interpretation of all-weather, all-time Synthetic Aperture Radar (SAR) is crucial for advancing remote sensing applications. In recent years, although Visual Language Models (VLMs) have demonstrated strong open-world understanding capabilities on RGB images, their performance is severely limited when directly applied to the SAR field due to the complexity of the imaging mechanism, sensitivity to scattering features, and the scarcity of high-quality text corpora. To systematically address this issue, we constructed the inaugural SAR Image-Text-AlphaEarth feature triplet dataset and developed FUSAR-GPT, a VLM specifically for SAR. FUSAR-GPT innovatively introduces a geospatial baseline model as a 'world knowledge' prior and embeds multi-source remote-sensing temporal features into the model's visual backbone via 'spatiotemporal anchors', enabling dynamic compensation for the sparse representation of targets in SAR images. Furthermore, we designed a two-stage SFT strategy to decouple the knowledge injection and task execution of large models. The spatiotemporal feature embedding and the two-stage decoupling paradigm enable FUSAR-GPT to achieve state-of-the-art performance across several typical remote sensing visual-language benchmark tests, significantly outperforming mainstream baseline models by over 12%.
Paper Structure (18 sections, 19 equations, 7 figures, 7 tables)

This paper contains 18 sections, 19 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: FUSAR-GPT: Embeds spatiotemporal semantic features to perform feature compensation on SAR images and uses two-stage decoupled SFT to make it suitable for target interpretation of SAR images.
  • Figure 2: Overview of FUSAR-GPT. The framework adopts a two-stage training strategy: Stage-1 jointly updates LoRA and the TLM-MLP for multimodal prior injection, while Stage-2 fine-tunes only LoRA for task adaptation. The TLM module fuses SAR visual tokens with AEF priors by generating spatially informed modulation parameters, which are applied to the visual tokens to enhance downstream reasoning.
  • Figure 3: AEF–SAR Visual Comparison. AEF embeddings extracted at different times are visualized by mapping channels 1, 16 and 9 of the 64-dimensional feature vector to RGB.
  • Figure 4: Overview of the four downstream tasks used in this work: Target Counting, Spatial Localization, Target Classification, and Target Detection.
  • Figure 5: Performance of different training stages across downstream tasks. a: Target Counting; b: Spatial Localization; c: Target Classification; d: Target Detection.
  • ...and 2 more figures