Table of Contents
Fetching ...

OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data

Fengxiang Wang, Mingshuo Chen, Xuming He, Yueying Li, YiFan Zhang, Feng Liu, Zijie Guo, Zhenghao Hu, Jiong Wang, Jingyi Xu, Zhangrui Li, Fenghua Ling, Ben Fei, Weijia Li, Long Lan, Wenjing Yang, Wenlong Zhang, Lei Bai

TL;DR

OmniEarth-Bench addresses the need for holistic evaluation of multimodal Earth-system models by introducing a benchmark that spans all six spheres and cross-sphere interactions. It presents a four-level evaluation framework with 109 Level-4 tasks and nearly 30,000 expert-annotated samples drawn from 33 data sources, created via a rigorous expert-led pipeline. Evaluations across nine state-of-the-art MLLMs reveal substantial gaps in Earth-system cognition, with none surpassing 35% accuracy, underscoring the need for domain-specific knowledge and advanced reasoning. The benchmark, dataset, and evaluation code aim to catalyze development of geoscience-focused MLLMs and enable system-level environmental monitoring and climate science applications.

Abstract

Existing benchmarks for multimodal learning in Earth science offer limited, siloed coverage of Earth's spheres and their cross-sphere interactions, typically restricting evaluation to the human-activity sphere of atmosphere and to at most 16 tasks. These limitations: \textit{narrow-source heterogeneity (single/few data sources), constrained scientific granularity, and limited-sphere extensibility}. Therefore, we introduce \textbf{OmniEarth-Bench}, the first multimodal benchmark that systematically spans all six spheres: atmosphere, lithosphere, oceanosphere, cryosphere, biosphere, and human-activity sphere, and cross-spheres. Built with a scalable, modular-topology data inference framework and native multi-observation sources and expert-in-the-loop curation, OmniEarth-Bench produces 29,855 standardized, expert-curated annotations. All annotations are organized into a four-level hierarchy (Sphere, Scenario, Ability, Task), encompassing 109 expert-curated evaluation tasks. Experiments on 9 state-of-the-art MLLMs reveal that even the most advanced models struggle with our benchmarks, where none of them reach 35\% accuracy, revealing systematic gaps in Earth-system cognitive ability. The dataset and evaluation code were released at OmniEarth-Bench (https://anonymous.4open.science/r/OmniEarth-Bench-B1BD).

OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data

TL;DR

OmniEarth-Bench addresses the need for holistic evaluation of multimodal Earth-system models by introducing a benchmark that spans all six spheres and cross-sphere interactions. It presents a four-level evaluation framework with 109 Level-4 tasks and nearly 30,000 expert-annotated samples drawn from 33 data sources, created via a rigorous expert-led pipeline. Evaluations across nine state-of-the-art MLLMs reveal substantial gaps in Earth-system cognition, with none surpassing 35% accuracy, underscoring the need for domain-specific knowledge and advanced reasoning. The benchmark, dataset, and evaluation code aim to catalyze development of geoscience-focused MLLMs and enable system-level environmental monitoring and climate science applications.

Abstract

Existing benchmarks for multimodal learning in Earth science offer limited, siloed coverage of Earth's spheres and their cross-sphere interactions, typically restricting evaluation to the human-activity sphere of atmosphere and to at most 16 tasks. These limitations: \textit{narrow-source heterogeneity (single/few data sources), constrained scientific granularity, and limited-sphere extensibility}. Therefore, we introduce \textbf{OmniEarth-Bench}, the first multimodal benchmark that systematically spans all six spheres: atmosphere, lithosphere, oceanosphere, cryosphere, biosphere, and human-activity sphere, and cross-spheres. Built with a scalable, modular-topology data inference framework and native multi-observation sources and expert-in-the-loop curation, OmniEarth-Bench produces 29,855 standardized, expert-curated annotations. All annotations are organized into a four-level hierarchy (Sphere, Scenario, Ability, Task), encompassing 109 expert-curated evaluation tasks. Experiments on 9 state-of-the-art MLLMs reveal that even the most advanced models struggle with our benchmarks, where none of them reach 35\% accuracy, revealing systematic gaps in Earth-system cognitive ability. The dataset and evaluation code were released at OmniEarth-Bench (https://anonymous.4open.science/r/OmniEarth-Bench-B1BD).

Paper Structure

This paper contains 47 sections, 18 figures, 15 tables.

Figures (18)

  • Figure 1: Overview of OmniEarth-Bench. Our benchmark spans six Earth science spheres and cross-sphere, encompassing 109 expert-curated tasks derived from 33 data sources. This involved the efforts of 20 experts and 45 crowd-sourcing annotators contributing to the annotations.
  • Figure 2: Comparison of Different Benchmarks. Our OmniEarth-Bench shows the broadest coverage, with dedicated cross-sphere dimensions.
  • Figure 3: Pipeline of OmniEarth-Bench. Our pipeline comprises 4 stages: Source Screening, Task Construction, Dataset Construction, and Quality Control, all led by experts. The first two stages are exclusively conducted by experts, while crowd-sourcing annotators assist in the latter two stages.
  • Figure 4: Examples of OmniEarth-Bench. OmniEarth-Bench comprises 109 unique L4 tasks, each with distinct questions, answers, and images.
  • Figure 5: Details of Dimension in Cross-sphere. Cross-sphere has 3 L-2 dimensions, 2 L-3 dimensions and 7 L-4 dimensions.
  • ...and 13 more figures