Table of Contents
Fetching ...

SAR-TEXT: A Large-Scale SAR Image-Text Dataset Built with SAR-Narrator and A Progressive Learning Strategy for Downstream Tasks

Yiguo He, Xinjun Cheng, Junjie Zhu, Chunping Qiu, Jun Wang, Xichuan Zhang, Qiangjuan Huang, Ke Yang

TL;DR

This work tackles the shortage of large-scale SAR image-text data and the modality gap between optical and SAR imagery. It introduces SAR-Narrator to auto-generate high-quality SAR captions and builds SAR-TEXT with over 136k image-text pairs. A two-stage progressive transfer learning strategy pre-trains on optical remote sensing data and fine-tunes on SAR-TEXT to develop SAR-RS-CLIP and SAR-RS-CoCa, achieving improvements in retrieval and captioning. Additionally, SAR-GPT experiments explore SAR-VQA capabilities, and the authors provide public code and data to facilitate community adoption.

Abstract

Vision Language Models (VLMs) have achieved remarkable breakthroughs in the field of remote sensing in recent years. Synthetic Aperture Radar (SAR) imagery, with its all-weather capability, is essential in remote sensing, yet the lack of large-scale, high-quality SAR image-text datasets hinders its semantic understanding. In this paper, we construct SAR-TEXT, a large-scale and high-quality dataset consisting of over 130,000 SAR image-text pairs. To construct the SAR-TEXT dataset, we design the SAR-Narrator framework, which generates textual descriptions for SAR images through a multi-stage strategy. To verify the effectiveness of the SAR-TEXT dataset, we conduct experiments on three typical vision-language tasks: image-text retrieval, image captioning, and visual question answering (VQA). Specifically, we construct three representative models on SAR-TEXT: SAR-RS-CLIP, SAR-RS-CoCa, and SAR-GPT. SAR-RS-CLIP achieves notable improvements in retrieval performance, boosting average recall by 12.97% and 10.0% on the OSdataset_512 and HRSID test sets, respectively. In the captioning task, SAR-RS-CoCa achieves significant improvements over the original CoCa models in terms of BLEU-4, SPICE, and CIDEr scores. In the VQA task, SAR-GPT outperforms baseline and single-stage models on multiple SAR-VQA datasets, demonstrating stronger semantic understanding and reasoning ability, as further confirmed by qualitative results. It is worth noting that, as a flexible captioning tool, SAR-Narrator can be readily adopted by the community to construct larger-scale SAR image-text datasets. All code, pretrained models, and the SAR-Text dataset are publicly available at: https://github.com/YiguoHe/SAR-TEXT.

SAR-TEXT: A Large-Scale SAR Image-Text Dataset Built with SAR-Narrator and A Progressive Learning Strategy for Downstream Tasks

TL;DR

This work tackles the shortage of large-scale SAR image-text data and the modality gap between optical and SAR imagery. It introduces SAR-Narrator to auto-generate high-quality SAR captions and builds SAR-TEXT with over 136k image-text pairs. A two-stage progressive transfer learning strategy pre-trains on optical remote sensing data and fine-tunes on SAR-TEXT to develop SAR-RS-CLIP and SAR-RS-CoCa, achieving improvements in retrieval and captioning. Additionally, SAR-GPT experiments explore SAR-VQA capabilities, and the authors provide public code and data to facilitate community adoption.

Abstract

Vision Language Models (VLMs) have achieved remarkable breakthroughs in the field of remote sensing in recent years. Synthetic Aperture Radar (SAR) imagery, with its all-weather capability, is essential in remote sensing, yet the lack of large-scale, high-quality SAR image-text datasets hinders its semantic understanding. In this paper, we construct SAR-TEXT, a large-scale and high-quality dataset consisting of over 130,000 SAR image-text pairs. To construct the SAR-TEXT dataset, we design the SAR-Narrator framework, which generates textual descriptions for SAR images through a multi-stage strategy. To verify the effectiveness of the SAR-TEXT dataset, we conduct experiments on three typical vision-language tasks: image-text retrieval, image captioning, and visual question answering (VQA). Specifically, we construct three representative models on SAR-TEXT: SAR-RS-CLIP, SAR-RS-CoCa, and SAR-GPT. SAR-RS-CLIP achieves notable improvements in retrieval performance, boosting average recall by 12.97% and 10.0% on the OSdataset_512 and HRSID test sets, respectively. In the captioning task, SAR-RS-CoCa achieves significant improvements over the original CoCa models in terms of BLEU-4, SPICE, and CIDEr scores. In the VQA task, SAR-GPT outperforms baseline and single-stage models on multiple SAR-VQA datasets, demonstrating stronger semantic understanding and reasoning ability, as further confirmed by qualitative results. It is worth noting that, as a flexible captioning tool, SAR-Narrator can be readily adopted by the community to construct larger-scale SAR image-text datasets. All code, pretrained models, and the SAR-Text dataset are publicly available at: https://github.com/YiguoHe/SAR-TEXT.

Paper Structure

This paper contains 26 sections, 12 figures, 9 tables, 2 algorithms.

Figures (12)

  • Figure 1: An overall performance comparison between SAR-TEXT and 10 other datasets (OpenSARShip huang2017opensarship, FUSAR-Map shi2021object, SSDD zhang2021sar, MSAR xia2022crtranssar, SADD zhang2022sefepnet, SAR-AIRcraft zhirui2023sar, SIVED lin2023sived, SARDet-100k li2024sardet, BRIGHT chen2025bright, and OpenEarthMap-SAR xia2025openearthmap) across 3 different dimensions at dataset size, resolution, and supported task types. Results demonstrate that SAR-TEXT outperformed existing datasets, showcasing superior and more comprehensive application potential in SAR image interpretation.
  • Figure 2: Overview of the SAR-Narrator framework and SAR-TEXT dataset: SAR images are automatically captioned by SAR-Narrator to construct SAR-TEXT, a high-quality dataset containing 130,000+ text-image pairs. Subsequently, two models—SAR-RS-CLIP and SAR-CoCa—are trained using vision-language models (VLMs) and a progressive two-stage fine-tuning strategy to achieve more effective semantic interpretation of SAR images.
  • Figure 3: SAR-TEXT dataset construction method.
  • Figure 4: SAR-TEXT dataset sample example.
  • Figure 5: SAR-TEXT dataset word cloud.
  • ...and 7 more figures