Table of Contents
Fetching ...

SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding

Yimin Wei, Aoran Xiao, Yexian Ren, Yuting Zhu, Hongruixuan Chen, Junshi Xia, Naoto Yokoya

TL;DR

SARLANG-1M addresses a critical gap in SAR interpretation by providing a large-scale, SAR-focused vision-language benchmark. It combines two tasks—captioning (SARLANG-1M-Cap) and VQA (SARLANG-1M-VQA)—generated via RGB-to-SAR text transfer and bounding-box grounded annotations, with expert quality control. Fine-tuning mainstream VLMs on SARLANG-1M yields substantial gains, achieving performance comparable to human experts on SAR VQA and improving captioning quality, especially for complex descriptions; preprocessing further enhances results. The dataset and code are publicly available to accelerate open, SAR-specific development of open-vocabulary, cross-modal SAR understanding across multi-resolution imagery.

Abstract

Synthetic Aperture Radar (SAR) is a crucial remote sensing technology, enabling all-weather, day-and-night observation with strong surface penetration for precise and continuous environmental monitoring and analysis. However, SAR image interpretation remains challenging due to its complex physical imaging mechanisms and significant visual disparities from human perception. Recently, Vision-Language Models (VLMs) have demonstrated remarkable success in RGB image understanding, offering powerful open-vocabulary interpretation and flexible language interaction. However, their application to SAR images is severely constrained by the absence of SAR-specific knowledge in their training distributions, leading to suboptimal performance. To address this limitation, we introduce SARLANG-1M, a large-scale benchmark tailored for multimodal SAR image understanding, with a primary focus on integrating SAR with textual modality. SARLANG-1M comprises more than 1 million high-quality SAR image-text pairs collected from over 59 cities worldwide. It features hierarchical resolutions (ranging from 0.1 to 25 meters), fine-grained semantic descriptions (including both concise and detailed captions), diverse remote sensing categories (1,696 object types and 16 land cover classes), and multi-task question-answering pairs spanning seven applications and 1,012 question types. Extensive experiments on mainstream VLMs demonstrate that fine-tuning with SARLANG-1M significantly enhances their performance in SAR image interpretation, reaching performance comparable to human experts. The dataset and code will be made publicly available at https://github.com/Jimmyxichen/SARLANG-1M.

SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding

TL;DR

SARLANG-1M addresses a critical gap in SAR interpretation by providing a large-scale, SAR-focused vision-language benchmark. It combines two tasks—captioning (SARLANG-1M-Cap) and VQA (SARLANG-1M-VQA)—generated via RGB-to-SAR text transfer and bounding-box grounded annotations, with expert quality control. Fine-tuning mainstream VLMs on SARLANG-1M yields substantial gains, achieving performance comparable to human experts on SAR VQA and improving captioning quality, especially for complex descriptions; preprocessing further enhances results. The dataset and code are publicly available to accelerate open, SAR-specific development of open-vocabulary, cross-modal SAR understanding across multi-resolution imagery.

Abstract

Synthetic Aperture Radar (SAR) is a crucial remote sensing technology, enabling all-weather, day-and-night observation with strong surface penetration for precise and continuous environmental monitoring and analysis. However, SAR image interpretation remains challenging due to its complex physical imaging mechanisms and significant visual disparities from human perception. Recently, Vision-Language Models (VLMs) have demonstrated remarkable success in RGB image understanding, offering powerful open-vocabulary interpretation and flexible language interaction. However, their application to SAR images is severely constrained by the absence of SAR-specific knowledge in their training distributions, leading to suboptimal performance. To address this limitation, we introduce SARLANG-1M, a large-scale benchmark tailored for multimodal SAR image understanding, with a primary focus on integrating SAR with textual modality. SARLANG-1M comprises more than 1 million high-quality SAR image-text pairs collected from over 59 cities worldwide. It features hierarchical resolutions (ranging from 0.1 to 25 meters), fine-grained semantic descriptions (including both concise and detailed captions), diverse remote sensing categories (1,696 object types and 16 land cover classes), and multi-task question-answering pairs spanning seven applications and 1,012 question types. Extensive experiments on mainstream VLMs demonstrate that fine-tuning with SARLANG-1M significantly enhances their performance in SAR image interpretation, reaching performance comparable to human experts. The dataset and code will be made publicly available at https://github.com/Jimmyxichen/SARLANG-1M.

Paper Structure

This paper contains 25 sections, 14 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Examplar Cases of Mainstream VLMs that Exhibit Strong Performance on RGB images but Struggle with Paired SAR Images. The first row demonstrates the image captioning results generated by the DeepSeekVL-7B lu2024deepseek, GPT-4o achiam2023gpt, and LLaVA1.5-7B liu2024visual models, with the left, center, and right examples illustrating their respective outputs. The second row presents the VQA results produced by the same models, with the left, center, and right examples further highlighting the challenges associated with SAR image interpretation. Key terms in the text are highlighted with red underlines for emphasis. For instance, the LLaVA1.5-7B liu2024visual model, as shown in the right example in the first row, misclassifies a residential scene as a person’s hand. Furthermore, in the right example in the second row, the LLaVA1.5-7B liu2024visual model fails to identify the presence of water in the SAR image, despite successfully recognizing it in the corresponding RGB image.
  • Figure 2: Examples from the SARLANG-1M dataset.SARLANG-1M consists of two benchmarks: SARLANG-1M-Cap for SAR image captioning and SARLANG-1M-VQA for SAR image VQA. The SARLANG-1M-Cap benchmark is characterized by four dimensions: caption type, presence of location information, resolution, and number of categories, supporting the task of detailed SAR image description. In contrast, the SARLANG-1M-VQA benchmark is designed to enable six additional applications, including object identification, object classification, instance counting, region referring, object positioning and others.
  • Figure 3: Cities included in the SARLANG-1M dataset. SARLANG-1M contains SAR images from over 59 cities worldwide, most of which are highlighted on the map.
  • Figure 4: The Statistics of Text Annotations in SARLANG-1M benchmark. (a) Distribution of seven applications provided in SARLANG-1M benchmark. (b) Numbers of each question types in the 'others' application. (c) Distribution of the 30 most frequent object categories.
  • Figure 5: Pre-processing Pipeline for SAR Images in Our Benchmark. All SAR images in our dataset, except those from the SARDet-100k li2024sardet dataset, undergo a standardized preprocessing pipeline, which includes the selection of specific polarizations, followed by denoising yommy2015sar and subsequent contrast enhancement ai2019outliers to optimize image quality and interpretability.
  • ...and 4 more figures