Table of Contents
Fetching ...

GroundSet: A Cadastral-Grounded Dataset for Spatial Understanding with Vector Data

Roger Ferrod, Maël Lecene, Krishna Sapkota, George Leifman, Vered Silverman, Genady Beryozkin, Sylvain Lobry

Abstract

Precise spatial understanding in Earth Observation is essential for translating raw aerial imagery into actionable insights for critical applications like urban planning, environmental monitoring and disaster management. However, Multimodal Large Language Models exhibit critical deficiencies in fine-grained spatial understanding within Remote Sensing, primarily due to a reliance on limited or repurposed legacy datasets. To bridge this gap, we introduce a large-scale dataset grounded in verifiable cadastral vector data, comprising 3.8 million annotated objects across 510k high-resolution images with 135 granular semantic categories. We validate this resource through a comprehensive instruction-tuning benchmark spanning seven spatial reasoning tasks. Our evaluation establishes a robust baseline using a standard LLaVA architecture. We show that while current RS-specialized and commercial models (e.g., Gemini) struggle in zero-shot settings, high-fidelity supervision effectively bridges this gap, enabling standard architectures to master fine-grained spatial grounding without complex architectural modifications.

GroundSet: A Cadastral-Grounded Dataset for Spatial Understanding with Vector Data

Abstract

Precise spatial understanding in Earth Observation is essential for translating raw aerial imagery into actionable insights for critical applications like urban planning, environmental monitoring and disaster management. However, Multimodal Large Language Models exhibit critical deficiencies in fine-grained spatial understanding within Remote Sensing, primarily due to a reliance on limited or repurposed legacy datasets. To bridge this gap, we introduce a large-scale dataset grounded in verifiable cadastral vector data, comprising 3.8 million annotated objects across 510k high-resolution images with 135 granular semantic categories. We validate this resource through a comprehensive instruction-tuning benchmark spanning seven spatial reasoning tasks. Our evaluation establishes a robust baseline using a standard LLaVA architecture. We show that while current RS-specialized and commercial models (e.g., Gemini) struggle in zero-shot settings, high-fidelity supervision effectively bridges this gap, enabling standard architectures to master fine-grained spatial grounding without complex architectural modifications.
Paper Structure (28 sections, 10 figures, 7 tables)

This paper contains 28 sections, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Overview of our dataset. We release both the raw vector data and a derived instruction-tuning dataset. The cadastral vector layers (e.g., buildings, vegetation, roads) serve as the ground truth and are automatically transformed into instructions for downstream tasks including scene captioning, object detection, semantic segmentation, localized classification, Visual Question Answering (VQA) and Referring Expression Comprehension (REC).
  • Figure 2: The data covers 135 unique semantic categories: a) beyond the top-10, common classes include Canals (44k), Railroads (40k), Parking lots (16k), Tennis courts (15k), Cemeteries (13k) and Stadiums (10k). In terms of geometry, b) the dataset comprises 3.8M entities including linear features (e.g., roads, railroads, canals), bounding boxes (derived from original GIS polygons) and segmentation masks.
  • Figure 3: Qualitative samples from the proposed dataset, illustrating the semantic granularity and geometric precision of the cadastral annotations across diverse environments.
  • Figure 4: Statistical analysis of the dataset showing a) the density of objects per image and b) the distribution of vertex counts across the geometric shapes.
  • Figure 5: Selected administrative departments across the French territory.
  • ...and 5 more figures