Table of Contents
Fetching ...

SOMA-1M: A Large-Scale SAR-Optical Multi-resolution Alignment Dataset for Multi-Task Remote Sensing

Peihao Wu, Yongxiang Yao, Yi Wan, Wenfei Zhang, Ruipeng Zhao, Jiayuan Li, Yongjun Zhang

TL;DR

SOMA-1M tackles the shortage of large-scale, high-precision, multi-resolution SAR–optical data by delivering over $1.3$ million pixel-aligned image pairs across $0.5$ m, $3$ m, and $10$ m resolutions. A rigorous coarse-to-fine registration framework ensures pixel-level alignment and preserves geolocation metadata, while four benchmarks (image matching, image fusion, SAR-assisted cloud removal, and SAR-to-optical translation) demonstrate the dataset's value across tasks and resolutions. Training on a $0.1$M SOMA-0.1M subset consistently improves state-of-the-art performance across baselines and tasks, with particularly strong gains in multimodal matching and translation when data are explicitly aligned. The findings reveal resolution-dependent strengths and weaknesses, supporting a multi-resolution hierarchical design to advance cross-modal remote sensing and foundation-model development that can operate globally with spatial awareness.

Abstract

Synthetic Aperture Radar (SAR) and optical imagery provide complementary strengths that constitute the critical foundation for transcending single-modality constraints and facilitating cross-modal collaborative processing and intelligent interpretation. However, existing benchmark datasets often suffer from limitations such as single spatial resolution, insufficient data scale, and low alignment accuracy, making them inadequate for supporting the training and generalization of multi-scale foundation models. To address these challenges, we introduce SOMA-1M (SAR-Optical Multi-resolution Alignment), a pixel-level precisely aligned dataset containing over 1.3 million pairs of georeferenced images with a specification of 512 x 512 pixels. This dataset integrates imagery from Sentinel-1, PIESAT-1, Capella Space, and Google Earth, achieving global multi-scale coverage from 0.5 m to 10 m. It encompasses 12 typical land cover categories, effectively ensuring scene diversity and complexity. To address multimodal projection deformation and massive data registration, we designed a rigorous coarse-to-fine image matching framework ensuring pixel-level alignment. Based on this dataset, we established comprehensive evaluation benchmarks for four hierarchical vision tasks, including image matching, image fusion, SAR-assisted cloud removal, and cross-modal translation, involving over 30 mainstream algorithms. Experimental results demonstrate that supervised training on SOMA-1M significantly enhances performance across all tasks. Notably, multimodal remote sensing image (MRSI) matching performance achieves current state-of-the-art (SOTA) levels. SOMA-1M serves as a foundational resource for robust multimodal algorithms and remote sensing foundation models. The dataset will be released publicly at: https://github.com/PeihaoWu/SOMA-1M.

SOMA-1M: A Large-Scale SAR-Optical Multi-resolution Alignment Dataset for Multi-Task Remote Sensing

TL;DR

SOMA-1M tackles the shortage of large-scale, high-precision, multi-resolution SAR–optical data by delivering over million pixel-aligned image pairs across m, m, and m resolutions. A rigorous coarse-to-fine registration framework ensures pixel-level alignment and preserves geolocation metadata, while four benchmarks (image matching, image fusion, SAR-assisted cloud removal, and SAR-to-optical translation) demonstrate the dataset's value across tasks and resolutions. Training on a M SOMA-0.1M subset consistently improves state-of-the-art performance across baselines and tasks, with particularly strong gains in multimodal matching and translation when data are explicitly aligned. The findings reveal resolution-dependent strengths and weaknesses, supporting a multi-resolution hierarchical design to advance cross-modal remote sensing and foundation-model development that can operate globally with spatial awareness.

Abstract

Synthetic Aperture Radar (SAR) and optical imagery provide complementary strengths that constitute the critical foundation for transcending single-modality constraints and facilitating cross-modal collaborative processing and intelligent interpretation. However, existing benchmark datasets often suffer from limitations such as single spatial resolution, insufficient data scale, and low alignment accuracy, making them inadequate for supporting the training and generalization of multi-scale foundation models. To address these challenges, we introduce SOMA-1M (SAR-Optical Multi-resolution Alignment), a pixel-level precisely aligned dataset containing over 1.3 million pairs of georeferenced images with a specification of 512 x 512 pixels. This dataset integrates imagery from Sentinel-1, PIESAT-1, Capella Space, and Google Earth, achieving global multi-scale coverage from 0.5 m to 10 m. It encompasses 12 typical land cover categories, effectively ensuring scene diversity and complexity. To address multimodal projection deformation and massive data registration, we designed a rigorous coarse-to-fine image matching framework ensuring pixel-level alignment. Based on this dataset, we established comprehensive evaluation benchmarks for four hierarchical vision tasks, including image matching, image fusion, SAR-assisted cloud removal, and cross-modal translation, involving over 30 mainstream algorithms. Experimental results demonstrate that supervised training on SOMA-1M significantly enhances performance across all tasks. Notably, multimodal remote sensing image (MRSI) matching performance achieves current state-of-the-art (SOTA) levels. SOMA-1M serves as a foundational resource for robust multimodal algorithms and remote sensing foundation models. The dataset will be released publicly at: https://github.com/PeihaoWu/SOMA-1M.
Paper Structure (40 sections, 1 equation, 12 figures, 7 tables)

This paper contains 40 sections, 1 equation, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Overview of the SOMA-1M dataset and examples of its multi-task applications. The two leftmost columns display the original SAR and optical input images. The remaining columns illustrate representative results generated by models trained on this dataset: (a) Image Matching; (b) Image Fusion; (c) SAR-Assisted Cloud Removal; and (d) SAR-to-Optical Translation.
  • Figure 2: Global geographic distribution of SOMA-1M sampling points.
  • Figure 3: Flowchart of the automated data annotation pipeline.
  • Figure 4: Visualization of alignment results.
  • Figure 5: Visualization examples of 12 typical land-cover categories in the SOMA-1M dataset. Each group presents a pair of SAR and optical images with strict pixel-level alignment.
  • ...and 7 more figures