Table of Contents
Fetching ...

SynRS3D: A Synthetic Dataset for Global 3D Semantic Understanding from Monocular Remote Sensing Imagery

Jian Song, Hongruixuan Chen, Weihao Xuan, Junshi Xia, Naoto Yokoya

TL;DR

This work tackles the challenge of global 3D semantic understanding from single-view high-resolution remote sensing imagery by introducing SynRS3D, the largest synthetic RS 3D dataset (69,667 images with six city styles, eight land-cover classes, height maps, and change masks), and RS3DAda, a multi-task unsupervised domain adaptation method designed for synthetic-to-real transfer in land-cover mapping and height estimation. Through a carefully crafted acquisition pipeline, diverse statistical grounding, and a hybrid self-training framework that leverages land-cover and height cues along with ground-guided refinements, the authors demonstrate that synthetic data can meaningfully bolster real-world RS tasks, especially when real data is scarce. RS3DAda achieves superior performance over existing UDA baselines, stabilizes training on synthetic data, and enables disaster mapping capabilities using height-difference analyses, establishing SynRS3D as a practical benchmark for future synthetic-to-real RS research. While gaps to real data performance remain, this work provides a concrete pathway for scalable global RS understanding from monocular imagery with strong implications for urban planning, environmental monitoring, and disaster response.

Abstract

Global semantic 3D understanding from single-view high-resolution remote sensing (RS) imagery is crucial for Earth Observation (EO). However, this task faces significant challenges due to the high costs of annotations and data collection, as well as geographically restricted data availability. To address these challenges, synthetic data offer a promising solution by being easily accessible and thus enabling the provision of large and diverse datasets. We develop a specialized synthetic data generation pipeline for EO and introduce SynRS3D, the largest synthetic RS 3D dataset. SynRS3D comprises 69,667 high-resolution optical images that cover six different city styles worldwide and feature eight land cover types, precise height information, and building change masks. To further enhance its utility, we develop a novel multi-task unsupervised domain adaptation (UDA) method, RS3DAda, coupled with our synthetic dataset, which facilitates the RS-specific transition from synthetic to real scenarios for land cover mapping and height estimation tasks, ultimately enabling global monocular 3D semantic understanding based on synthetic data. Extensive experiments on various real-world datasets demonstrate the adaptability and effectiveness of our synthetic dataset and proposed RS3DAda method. SynRS3D and related codes will be available.

SynRS3D: A Synthetic Dataset for Global 3D Semantic Understanding from Monocular Remote Sensing Imagery

TL;DR

This work tackles the challenge of global 3D semantic understanding from single-view high-resolution remote sensing imagery by introducing SynRS3D, the largest synthetic RS 3D dataset (69,667 images with six city styles, eight land-cover classes, height maps, and change masks), and RS3DAda, a multi-task unsupervised domain adaptation method designed for synthetic-to-real transfer in land-cover mapping and height estimation. Through a carefully crafted acquisition pipeline, diverse statistical grounding, and a hybrid self-training framework that leverages land-cover and height cues along with ground-guided refinements, the authors demonstrate that synthetic data can meaningfully bolster real-world RS tasks, especially when real data is scarce. RS3DAda achieves superior performance over existing UDA baselines, stabilizes training on synthetic data, and enables disaster mapping capabilities using height-difference analyses, establishing SynRS3D as a practical benchmark for future synthetic-to-real RS research. While gaps to real data performance remain, this work provides a concrete pathway for scalable global RS understanding from monocular imagery with strong implications for urban planning, environmental monitoring, and disaster response.

Abstract

Global semantic 3D understanding from single-view high-resolution remote sensing (RS) imagery is crucial for Earth Observation (EO). However, this task faces significant challenges due to the high costs of annotations and data collection, as well as geographically restricted data availability. To address these challenges, synthetic data offer a promising solution by being easily accessible and thus enabling the provision of large and diverse datasets. We develop a specialized synthetic data generation pipeline for EO and introduce SynRS3D, the largest synthetic RS 3D dataset. SynRS3D comprises 69,667 high-resolution optical images that cover six different city styles worldwide and feature eight land cover types, precise height information, and building change masks. To further enhance its utility, we develop a novel multi-task unsupervised domain adaptation (UDA) method, RS3DAda, coupled with our synthetic dataset, which facilitates the RS-specific transition from synthetic to real scenarios for land cover mapping and height estimation tasks, ultimately enabling global monocular 3D semantic understanding based on synthetic data. Extensive experiments on various real-world datasets demonstrate the adaptability and effectiveness of our synthetic dataset and proposed RS3DAda method. SynRS3D and related codes will be available.

Paper Structure

This paper contains 30 sections, 12 equations, 20 figures, 14 tables, 1 algorithm.

Figures (20)

  • Figure 1: 3D visualization outcomes from real-world monocular RS images, which uses the model trained on SynRS3D dataset with proposed RS3DAda method. "SA" indicates South America and "NA" indicates North America.
  • Figure 2: Examples and statistics of SynRS3D.
  • Figure 3: Generation workflow of SynRS3D.
  • Figure 4: Overview of the proposed RS3DAda method. T denotes statistical image translation, S represents strong augmentation. For Online Student Model Training, dotted line: target image, solid line: source image. For Ground-Guided Pesudo Label Generation, dotted line: original target image, dotted line: strong augmented target image.
  • Figure 4: Results of RS3DAda height estimation branch using DINOv2 oquab2023dinov2 and DPT ranftl2021vision. The experimental results are divided as follows: 'Whole' denotes the evaluation results for the entire image. 'High' signifies the experimental results for image regions above 3 meters. T.D.1 and T.D.2 correspond to Target Domain 1 and Target Domain 2, respectively, as specified in Table \ref{['table:realworld_datasets']}. Avg. stands for the average value.
  • ...and 15 more figures