Table of Contents
Fetching ...

MarsRetrieval: Benchmarking Vision-Language Models for Planetary-Scale Geospatial Retrieval on Mars

Shuoyuan Wang, Yiran Wang, Hongxin Wei

TL;DR

MarsRetrieval introduces a retrieval-centric benchmark to evaluate vision-language models for Martian geospatial discovery across multiple spatial scales. It defines three tasks—Paired Image–Text Retrieval, Landform Retrieval, and Global Geo-Localization—operating under a unified protocol that maps images and texts into a shared embedding space using cosine similarity, enabling zero-shot evaluation. Across encoder-based VLMs and MLLMs, the benchmark reveals substantial challenges and demonstrates that domain-specific fine-tuning is essential for robust, generalizable Martian geospatial discovery; caption refinement and prompt ensembles further improve performance. The dataset and protocol provide a reusable framework for Mars and can be adapted to other planetary contexts, facilitating progress toward scalable, language-guided planetary exploration.

Abstract

Data-driven approaches like deep learning are rapidly advancing planetary science, particularly in Mars exploration. Despite recent progress, most existing benchmarks remain confined to closed-set supervised visual tasks and do not support text-guided retrieval for geospatial discovery. We introduce MarsRetrieval, a retrieval benchmark for evaluating vision-language models for Martian geospatial discovery. MarsRetrieval includes three tasks: (1) paired image-text retrieval, (2) landform retrieval, and (3) global geo-localization, covering multiple spatial scales and diverse geomorphic origins. We propose a unified retrieval-centric protocol to benchmark multimodal embedding architectures, including contrastive dual-tower encoders and generative vision-language models. Our evaluation shows MarsRetrieval is challenging: even strong foundation models often fail to capture domain-specific geomorphic distinctions. We further show that domain-specific fine-tuning is critical for generalizable geospatial discovery in planetary settings. Our code is available at https://github.com/ml-stat-Sustech/MarsRetrieval

MarsRetrieval: Benchmarking Vision-Language Models for Planetary-Scale Geospatial Retrieval on Mars

TL;DR

MarsRetrieval introduces a retrieval-centric benchmark to evaluate vision-language models for Martian geospatial discovery across multiple spatial scales. It defines three tasks—Paired Image–Text Retrieval, Landform Retrieval, and Global Geo-Localization—operating under a unified protocol that maps images and texts into a shared embedding space using cosine similarity, enabling zero-shot evaluation. Across encoder-based VLMs and MLLMs, the benchmark reveals substantial challenges and demonstrates that domain-specific fine-tuning is essential for robust, generalizable Martian geospatial discovery; caption refinement and prompt ensembles further improve performance. The dataset and protocol provide a reusable framework for Mars and can be adapted to other planetary contexts, facilitating progress toward scalable, language-guided planetary exploration.

Abstract

Data-driven approaches like deep learning are rapidly advancing planetary science, particularly in Mars exploration. Despite recent progress, most existing benchmarks remain confined to closed-set supervised visual tasks and do not support text-guided retrieval for geospatial discovery. We introduce MarsRetrieval, a retrieval benchmark for evaluating vision-language models for Martian geospatial discovery. MarsRetrieval includes three tasks: (1) paired image-text retrieval, (2) landform retrieval, and (3) global geo-localization, covering multiple spatial scales and diverse geomorphic origins. We propose a unified retrieval-centric protocol to benchmark multimodal embedding architectures, including contrastive dual-tower encoders and generative vision-language models. Our evaluation shows MarsRetrieval is challenging: even strong foundation models often fail to capture domain-specific geomorphic distinctions. We further show that domain-specific fine-tuning is critical for generalizable geospatial discovery in planetary settings. Our code is available at https://github.com/ml-stat-Sustech/MarsRetrieval
Paper Structure (78 sections, 13 equations, 12 figures, 9 tables)

This paper contains 78 sections, 13 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: An overview of MarsRetrieval categories with examples. The benchmark consists of 3 challenging tasks for cross-modal retrieval in Martian exploration. See Table \ref{['tab_marsretrieval_overview']} for details about capabilities measured and other information.
  • Figure 2: Planetary scale coverage in task 1 (Paired Image–Text Retrieval). We include Martian paired samples from orbital terrain to rover instrumentation.
  • Figure 3: Dataset distribution across the 7 major genetic classes of Martian landforms in task 2 (Landform Retrieval). The detailed distribution of subclasses is shown in Figure \ref{['fig_task2_detailed_distribution']}.
  • Figure 3: Ablation on model scale in Paired Image-Text Retrieval (task 1). Best results within each family are in bold.
  • Figure 4: Global ground-truth distribution from scientific catalogs morgan2022globalsouness2012inventoryroback2021controlsmills2024globalliu2020mapping for Task 3 (Global Geo-Localization).
  • ...and 7 more figures