MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding
Oskar Kristoffersen, Alba R. Sánchez, Morten R. Hannemose, Anders B. Dahl, Dim P. Papadopoulos
TL;DR
The paper introduces MMLandmarks, a large-scale, four-modality, instance-level benchmark for geo-spatial understanding that aligns ground-view imagery, high-resolution aerial imagery, text, and GPS coordinates across 18,557 US landmarks. It presents a simple CLIP-inspired baseline that learns a shared embedding for all modalities using frozen image encoders, a text encoder, and a location encoder, trained with an extended InfoNCE objective, and demonstrates strong performance across cross-view retrieval and geolocalization tasks. The dataset design emphasizes one-to-one modality correspondence, diverse and time-varied imagery from NAIP, and permissive licensing to enable broad research and sharing. Ablation studies show the importance of outdoor-ground filtering and sampling strategies, highlighting the dataset’s value for developing genuinely unified multimodal geo-spatial models with practical impact for localization, navigation, and geographic reasoning.
Abstract
Geo-spatial analysis of our world benefits from a multimodal approach, as every single geographic location can be described in numerous ways (images from various viewpoints, textual descriptions, and geographic coordinates). Current geo-spatial benchmarks have limited coverage across modalities, considerably restricting progress in the field, as current approaches cannot integrate all relevant modalities within a unified framework. We introduce the Multi-Modal Landmark dataset (MMLANDMARKS), a benchmark composed of four modalities: 197k highresolution aerial images, 329k ground-view images, textual information, and geographic coordinates for 18,557 distinct landmarks in the United States. The MMLANDMARKS dataset has a one-to-one correspondence across every modality, which enables training and benchmarking models for various geo-spatial tasks, including cross-view Ground-to-Satellite retrieval, ground and satellite geolocalization, Text-to-Image, and Text-to-GPS retrieval. We demonstrate broad generalization and competitive performance against off-the-shelf foundational models and specialized state-of-the-art models across different tasks by employing a simple CLIP-inspired baseline, illustrating the necessity for multimodal datasets to achieve broad geo-spatial understanding.
