Table of Contents
Fetching ...

SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation

Zhiming Ma, Xiayang Xiao, Sihao Dong, Peidong Wang, HaiPeng Wang, Qingyun Pan

TL;DR

This work tackles the scarcity of SAR-domain vision-language resources by introducing SARChat-2M, a ~2 million image-text SAR dataset spanning maritime, terrestrial, and urban scenes with 0.3–10 m resolution, and six task types that promote multi-task SAR interpretation. It couples SARChat-2M with SARChat-Bench, a comprehensive benchmark evaluating six tasks—classification, fine-grained description, counting, grounding, cross-modal identification, and referring—across 16 mainstream VLMs, and analyzes performance gains from SAR-specific fine-tuning and model size. The study demonstrates that domain-adapted SAR VLMs significantly improve SAR interpretation, with edge-side models enabling potential real-time deployment, and provides a general framework for constructing multimodal datasets in other remote-sensing domains. Overall, the paper offers a practical path toward robust, domain-specialized VLMs for SAR imagery, with implications for military, maritime, and infrastructure monitoring while acknowledging dataset annotation limitations and ethical considerations.

Abstract

As a powerful all-weather Earth observation tool, synthetic aperture radar (SAR) remote sensing enables critical military reconnaissance, maritime surveillance, and infrastructure monitoring. Although Vision language models (VLMs) have made remarkable progress in natural language processing and image understanding, their applications remain limited in professional domains due to insufficient domain expertise. This paper innovatively proposes the first large-scale multimodal dialogue dataset for SAR images, named SARChat-2M, which contains approximately 2 million high-quality image-text pairs, encompasses diverse scenarios with detailed target annotations. This dataset not only supports several key tasks such as visual understanding and object detection tasks, but also has unique innovative aspects: this study develop a visual-language dataset and benchmark for the SAR domain, enabling and evaluating VLMs' capabilities in SAR image interpretation, which provides a paradigmatic framework for constructing multimodal datasets across various remote sensing vertical domains. Through experiments on 16 mainstream VLMs, the effectiveness of the dataset has been fully verified. The project will be released at https://github.com/JimmyMa99/SARChat.

SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation

TL;DR

This work tackles the scarcity of SAR-domain vision-language resources by introducing SARChat-2M, a ~2 million image-text SAR dataset spanning maritime, terrestrial, and urban scenes with 0.3–10 m resolution, and six task types that promote multi-task SAR interpretation. It couples SARChat-2M with SARChat-Bench, a comprehensive benchmark evaluating six tasks—classification, fine-grained description, counting, grounding, cross-modal identification, and referring—across 16 mainstream VLMs, and analyzes performance gains from SAR-specific fine-tuning and model size. The study demonstrates that domain-adapted SAR VLMs significantly improve SAR interpretation, with edge-side models enabling potential real-time deployment, and provides a general framework for constructing multimodal datasets in other remote-sensing domains. Overall, the paper offers a practical path toward robust, domain-specialized VLMs for SAR imagery, with implications for military, maritime, and infrastructure monitoring while acknowledging dataset annotation limitations and ethical considerations.

Abstract

As a powerful all-weather Earth observation tool, synthetic aperture radar (SAR) remote sensing enables critical military reconnaissance, maritime surveillance, and infrastructure monitoring. Although Vision language models (VLMs) have made remarkable progress in natural language processing and image understanding, their applications remain limited in professional domains due to insufficient domain expertise. This paper innovatively proposes the first large-scale multimodal dialogue dataset for SAR images, named SARChat-2M, which contains approximately 2 million high-quality image-text pairs, encompasses diverse scenarios with detailed target annotations. This dataset not only supports several key tasks such as visual understanding and object detection tasks, but also has unique innovative aspects: this study develop a visual-language dataset and benchmark for the SAR domain, enabling and evaluating VLMs' capabilities in SAR image interpretation, which provides a paradigmatic framework for constructing multimodal datasets across various remote sensing vertical domains. Through experiments on 16 mainstream VLMs, the effectiveness of the dataset has been fully verified. The project will be released at https://github.com/JimmyMa99/SARChat.

Paper Structure

This paper contains 28 sections, 5 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: An overview of SARChat-Bench-2M. The left figure demonstrates the representative tasks realized with the SAR image-text dataset, SARChat-2M, constructed in this paper. Validating the dataset's efficacy and superiority in supporting multi-task applications. The right figure presents the correlation radar charts and quantitative line graphs derived from the performance evaluation of 16 VLMs basing on this dataset, establishing the benchmark (SARChat-Bench) within this domain.
  • Figure 2: Construction of SARChat-2M dataset. On the left, ten existing SAR detection benchmark datasets. The middle part is the SARDet-100K dataset, formed by integrating the ten datasets on the left. On the right, six core tasks constructed based on the dataset are presented, with each task corresponding to different task identifiers, operation steps, and relevant templates.
  • Figure 3: Evaluation examples on SARChat-Bench. VLM predictions are shown in green/red for correct/incorrect descriptions, with the ground truth in green and the predictions in red boxes. And [Human], [Bot], and [Check] icons denote user input, VLMs response, and standard output, respectively.
  • Figure 4: Cloud Map of Word-frequency Distribution
  • Figure 5: The Proportion Distribution of Samples in the Training Set
  • ...and 5 more figures