Table of Contents
Fetching ...

MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding

Oskar Kristoffersen, Alba R. Sánchez, Morten R. Hannemose, Anders B. Dahl, Dim P. Papadopoulos

TL;DR

The paper introduces MMLandmarks, a large-scale, four-modality, instance-level benchmark for geo-spatial understanding that aligns ground-view imagery, high-resolution aerial imagery, text, and GPS coordinates across 18,557 US landmarks. It presents a simple CLIP-inspired baseline that learns a shared embedding for all modalities using frozen image encoders, a text encoder, and a location encoder, trained with an extended InfoNCE objective, and demonstrates strong performance across cross-view retrieval and geolocalization tasks. The dataset design emphasizes one-to-one modality correspondence, diverse and time-varied imagery from NAIP, and permissive licensing to enable broad research and sharing. Ablation studies show the importance of outdoor-ground filtering and sampling strategies, highlighting the dataset’s value for developing genuinely unified multimodal geo-spatial models with practical impact for localization, navigation, and geographic reasoning.

Abstract

Geo-spatial analysis of our world benefits from a multimodal approach, as every single geographic location can be described in numerous ways (images from various viewpoints, textual descriptions, and geographic coordinates). Current geo-spatial benchmarks have limited coverage across modalities, considerably restricting progress in the field, as current approaches cannot integrate all relevant modalities within a unified framework. We introduce the Multi-Modal Landmark dataset (MMLANDMARKS), a benchmark composed of four modalities: 197k highresolution aerial images, 329k ground-view images, textual information, and geographic coordinates for 18,557 distinct landmarks in the United States. The MMLANDMARKS dataset has a one-to-one correspondence across every modality, which enables training and benchmarking models for various geo-spatial tasks, including cross-view Ground-to-Satellite retrieval, ground and satellite geolocalization, Text-to-Image, and Text-to-GPS retrieval. We demonstrate broad generalization and competitive performance against off-the-shelf foundational models and specialized state-of-the-art models across different tasks by employing a simple CLIP-inspired baseline, illustrating the necessity for multimodal datasets to achieve broad geo-spatial understanding.

MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding

TL;DR

The paper introduces MMLandmarks, a large-scale, four-modality, instance-level benchmark for geo-spatial understanding that aligns ground-view imagery, high-resolution aerial imagery, text, and GPS coordinates across 18,557 US landmarks. It presents a simple CLIP-inspired baseline that learns a shared embedding for all modalities using frozen image encoders, a text encoder, and a location encoder, trained with an extended InfoNCE objective, and demonstrates strong performance across cross-view retrieval and geolocalization tasks. The dataset design emphasizes one-to-one modality correspondence, diverse and time-varied imagery from NAIP, and permissive licensing to enable broad research and sharing. Ablation studies show the importance of outdoor-ground filtering and sampling strategies, highlighting the dataset’s value for developing genuinely unified multimodal geo-spatial models with practical impact for localization, navigation, and geographic reasoning.

Abstract

Geo-spatial analysis of our world benefits from a multimodal approach, as every single geographic location can be described in numerous ways (images from various viewpoints, textual descriptions, and geographic coordinates). Current geo-spatial benchmarks have limited coverage across modalities, considerably restricting progress in the field, as current approaches cannot integrate all relevant modalities within a unified framework. We introduce the Multi-Modal Landmark dataset (MMLANDMARKS), a benchmark composed of four modalities: 197k highresolution aerial images, 329k ground-view images, textual information, and geographic coordinates for 18,557 distinct landmarks in the United States. The MMLANDMARKS dataset has a one-to-one correspondence across every modality, which enables training and benchmarking models for various geo-spatial tasks, including cross-view Ground-to-Satellite retrieval, ground and satellite geolocalization, Text-to-Image, and Text-to-GPS retrieval. We demonstrate broad generalization and competitive performance against off-the-shelf foundational models and specialized state-of-the-art models across different tasks by employing a simple CLIP-inspired baseline, illustrating the necessity for multimodal datasets to achieve broad geo-spatial understanding.

Paper Structure

This paper contains 18 sections, 1 equation, 15 figures, 7 tables, 1 algorithm.

Figures (15)

  • Figure 1: MMLandmarks. We present four distinct data modalities: ground-view images, aerial imagery, GPS coordinates, and textual descriptions, collected from $18{,}557$ unique landmarks in the United States. Data sources are included alongside each modality.
  • Figure 2: Pipeline for collecting the landmarks with the required criteria. Tags from OpenStreetMaps are used to collect Wiki-identifiers, ensuring that landmarks have a Wikipedia and Wikimedia Commons page. If both are available, we check that the longest edge of the landmark's bounding box is smaller than 400 meters to keep an even size distribution across the dataset. Every resulting landmark has a Wikimedia Commons page (ground), a Wikipedia page (text), a box size and center (coordinates), and associated aerial imagery (satellite).
  • Figure 3: Text-to-GPS (top $1000$), Text-to-Ground and Text-to-Satellite retrieval from the index set with the baseline model. The model accurately locates regions and images that are semantically relevant to the prompt, illustrating strong feature alignment across modalities.
  • Figure 4: Histogram distribution of the number of images per landmark. A large proportion of landmarks have between 1 and 10 ground images, with a long-tailed distribution. The number of satellite images per landmark follows a bell curve centred at 10 images, with some landmarks having up to 20 aerial images.
  • Figure 5: Visual and Geographical illustrations of the landmark distribution across MMLandmarks.
  • ...and 10 more figures