Table of Contents
Fetching ...

Geo3DVQA: Evaluating Vision-Language Models for 3D Geospatial Reasoning from Aerial Imagery

Mai Tsujimoto, Junjue Wang, Weihao Xuan, Naoto Yokoya

TL;DR

Geo3DVQA introduces the first RGB-only benchmark for height-aware 3D geospatial reasoning, addressing the gap between 2D vision-language capabilities and RGB-to-3D inference. By organizing tasks into three tiers across 16 categories and evaluating ten VLMs, the study reveals fundamental limitations in RGB-based 3D reasoning while demonstrating that domain-specific instruction tuning markedly improves performance. The results show substantial gains for SVF and land-cover tasks with fine-tuning, though precise coordinate-level height remains challenging, suggesting a need for architectural innovations. Overall, Geo3DVQA provides a unified framework to benchmark and accelerate accessible 3D geospatial analysis from widely available RGB imagery.

Abstract

Three-dimensional geospatial analysis is critical for applications in urban planning, climate adaptation, and environmental assessment. However, current methodologies depend on costly, specialized sensors, such as LiDAR and multispectral sensors, which restrict global accessibility. Additionally, existing sensor-based and rule-driven methods struggle with tasks requiring the integration of multiple 3D cues, handling diverse queries, and providing interpretable reasoning. We present Geo3DVQA, a comprehensive benchmark that evaluates vision-language models (VLMs) in height-aware 3D geospatial reasoning from RGB imagery alone. Unlike conventional sensor-based frameworks, Geo3DVQA emphasizes realistic scenarios integrating elevation, sky view factors, and land cover patterns. The benchmark comprises 110k curated question-answer pairs across 16 task categories, including single-feature inference, multi-feature reasoning, and application-level analysis. Through a systematic evaluation of ten state-of-the-art VLMs, we reveal fundamental limitations in RGB-to-3D spatial reasoning. Our results further show that domain-specific instruction tuning consistently enhances model performance across all task categories, including height-aware and open-ended, application-oriented reasoning. Geo3DVQA provides a unified, interpretable framework for evaluating RGB-based 3D geospatial reasoning and identifies key challenges and opportunities for scalable 3D spatial analysis. The code and data are available at https://github.com/mm1129/Geo3DVQA.

Geo3DVQA: Evaluating Vision-Language Models for 3D Geospatial Reasoning from Aerial Imagery

TL;DR

Geo3DVQA introduces the first RGB-only benchmark for height-aware 3D geospatial reasoning, addressing the gap between 2D vision-language capabilities and RGB-to-3D inference. By organizing tasks into three tiers across 16 categories and evaluating ten VLMs, the study reveals fundamental limitations in RGB-based 3D reasoning while demonstrating that domain-specific instruction tuning markedly improves performance. The results show substantial gains for SVF and land-cover tasks with fine-tuning, though precise coordinate-level height remains challenging, suggesting a need for architectural innovations. Overall, Geo3DVQA provides a unified framework to benchmark and accelerate accessible 3D geospatial analysis from widely available RGB imagery.

Abstract

Three-dimensional geospatial analysis is critical for applications in urban planning, climate adaptation, and environmental assessment. However, current methodologies depend on costly, specialized sensors, such as LiDAR and multispectral sensors, which restrict global accessibility. Additionally, existing sensor-based and rule-driven methods struggle with tasks requiring the integration of multiple 3D cues, handling diverse queries, and providing interpretable reasoning. We present Geo3DVQA, a comprehensive benchmark that evaluates vision-language models (VLMs) in height-aware 3D geospatial reasoning from RGB imagery alone. Unlike conventional sensor-based frameworks, Geo3DVQA emphasizes realistic scenarios integrating elevation, sky view factors, and land cover patterns. The benchmark comprises 110k curated question-answer pairs across 16 task categories, including single-feature inference, multi-feature reasoning, and application-level analysis. Through a systematic evaluation of ten state-of-the-art VLMs, we reveal fundamental limitations in RGB-to-3D spatial reasoning. Our results further show that domain-specific instruction tuning consistently enhances model performance across all task categories, including height-aware and open-ended, application-oriented reasoning. Geo3DVQA provides a unified, interpretable framework for evaluating RGB-based 3D geospatial reasoning and identifies key challenges and opportunities for scalable 3D spatial analysis. The code and data are available at https://github.com/mm1129/Geo3DVQA.

Paper Structure

This paper contains 50 sections, 8 figures, 20 tables.

Figures (8)

  • Figure 1: Geo3DVQA Task Taxonomy and Evaluation Framework. Our benchmark comprises 16 task categories organized into three complexity tiers to evaluate RGB-to-3D spatial reasoning capabilities. Tier 1 (Single-feature): Direct inference of individual properties across SVF, land cover, and height from RGB patterns. Tier 2 (Multi-feature): Composite reasoning tasks requiring the integration of multiple spatial attributes (e.g., sky visibility combines SVF with building penalties). Tier 3 (Application-level): Free-form reasoning for real-world applications in urban planning, renewable energy, landscape analysis, and water accumulation. Ground truth generation uses multimodal reference data (SVF, DSM, land cover), whereas the models are evaluated using only RGB imagery. The colored regions indicate the locations of multiple-choice options, with answers highlighted for visualization.
  • Figure 2: Geo3DVQA Dataset Construction and Evaluation Pipeline. (1) Dataset Creation: Question templates were generated for each category, with ground-truth answers automatically derived from multimodal data (SVF, DSM, land cover) and their statistics. (2) Fine-tuning: Qwen2.5-VL learns height-aware spatial reasoning from RGB inputs using the constructed dataset. (3) Evaluation: Models are tested using only RGB inputs, with predictions compared against the ground truth. The fine-tuned model leverages domain knowledge from step (2).
  • Figure 3: Task composition of Geo3DVQA evaluation set (left). Subcategory-level distribution (right panel). Distribution across 12 task categories. Tier 3 was excluded (116 questions in total).
  • Figure 4: The shortened examples of Tier-3 free-form Q&A with RGB input and reference for ground truth. Statistics were first calculated from the reference modalities (SVF, DSM, segmentation) and input into GPT-4.1-mini to generate answers corresponding to pre-defined categories. These answers were validated by cross-checking them against the reference statistics using GPT-5. Q&As that passed the validation are then verified by humans, and the final Q&A pairs are used for evaluation.
  • Figure 5: Word cloud of ground-truth answers for Tier-3 free-form Q&As.
  • ...and 3 more figures