Table of Contents
Fetching ...

MSEarth: A Multimodal Scientific Dataset and Benchmark for Phenomena Uncovering in Earth Science

Xiangyu Zhao, Wanghan Xu, Bo Liu, Yuhao Zhou, Fenghua Ling, Ben Fei, Xiaoyu Yue, Lei Bai, Wenlong Zhang, Xiao-Ming Wu

TL;DR

MSEarth addresses the gap in graduate-level Earth-science reasoning by building a scalable multimodal benchmark from 64,560 open-access papers, yielding 289,891 figures with refined captions grounded in paper content. It introduces two components, MSEarthCap and MSEarthQA, and a rigorous multi-agent and expert validation pipeline to create high-quality VQA tasks (captioning, MCQ, open-ended QA). The framework provides both a robust test bed and a large training corpus, showing that current MLLMs struggle with specialized Earth-science reasoning and that domain-specific data and architectures are crucial for closing the reasoning gap. By openly releasing data and methods, MSEarth aims to accelerate development of domain-aware multimodal models and can be extended to other scientific disciplines.

Abstract

The rapid advancement of multimodal large language models (MLLMs) has unlocked new opportunities to tackle complex scientific challenges. Despite this progress, their application in addressing earth science problems, especially at the graduate level, remains underexplored. A significant barrier is the absence of benchmarks that capture the depth and contextual complexity of geoscientific reasoning. Current benchmarks often rely on synthetic datasets or simplistic figure-caption pairs, which do not adequately reflect the intricate reasoning and domain-specific insights required for real-world scientific applications. To address these gaps, we introduce MSEarth, a multimodal scientific benchmark curated from high-quality, open-access scientific publications. MSEarth encompasses the five major spheres of Earth science: atmosphere, cryosphere, hydrosphere, lithosphere, and biosphere, featuring over 289K figures with refined captions. These captions are crafted from the original figure captions and enriched with discussions and reasoning from the papers, ensuring the benchmark captures the nuanced reasoning and knowledge-intensive content essential for advanced scientific tasks. MSEarth supports a variety of tasks, including scientific figure captioning, multiple choice questions, and open-ended reasoning challenges. By bridging the gap in graduate-level benchmarks, MSEarth provides a scalable and high-fidelity resource to enhance the development and evaluation of MLLMs in scientific reasoning. The benchmark is publicly available to foster further research and innovation in this field.

MSEarth: A Multimodal Scientific Dataset and Benchmark for Phenomena Uncovering in Earth Science

TL;DR

MSEarth addresses the gap in graduate-level Earth-science reasoning by building a scalable multimodal benchmark from 64,560 open-access papers, yielding 289,891 figures with refined captions grounded in paper content. It introduces two components, MSEarthCap and MSEarthQA, and a rigorous multi-agent and expert validation pipeline to create high-quality VQA tasks (captioning, MCQ, open-ended QA). The framework provides both a robust test bed and a large training corpus, showing that current MLLMs struggle with specialized Earth-science reasoning and that domain-specific data and architectures are crucial for closing the reasoning gap. By openly releasing data and methods, MSEarth aims to accelerate development of domain-aware multimodal models and can be extended to other scientific disciplines.

Abstract

The rapid advancement of multimodal large language models (MLLMs) has unlocked new opportunities to tackle complex scientific challenges. Despite this progress, their application in addressing earth science problems, especially at the graduate level, remains underexplored. A significant barrier is the absence of benchmarks that capture the depth and contextual complexity of geoscientific reasoning. Current benchmarks often rely on synthetic datasets or simplistic figure-caption pairs, which do not adequately reflect the intricate reasoning and domain-specific insights required for real-world scientific applications. To address these gaps, we introduce MSEarth, a multimodal scientific benchmark curated from high-quality, open-access scientific publications. MSEarth encompasses the five major spheres of Earth science: atmosphere, cryosphere, hydrosphere, lithosphere, and biosphere, featuring over 289K figures with refined captions. These captions are crafted from the original figure captions and enriched with discussions and reasoning from the papers, ensuring the benchmark captures the nuanced reasoning and knowledge-intensive content essential for advanced scientific tasks. MSEarth supports a variety of tasks, including scientific figure captioning, multiple choice questions, and open-ended reasoning challenges. By bridging the gap in graduate-level benchmarks, MSEarth provides a scalable and high-fidelity resource to enhance the development and evaluation of MLLMs in scientific reasoning. The benchmark is publicly available to foster further research and innovation in this field.

Paper Structure

This paper contains 37 sections, 19 figures, 14 tables.

Figures (19)

  • Figure 1: Illustrative examples of the diverse types of scientific figures in MSEarth, sourced from open-access articles available from website.
  • Figure 1: Main statistics in MSEarth-Bench.
  • Figure 2: Illustration of VQA generation methodologies: (a) VQA relying exclusively on figure captions, and (b) VQA utilizing refined captions that integrate figure captions with content from academic papers. Highlighted areas denote questions and answers supported by evidence.
  • Figure 3: Data curation process for MSEarth. The two parts on the left represent data preprocessing, while the two parts on the right encompass the automated generation of VQA and expert-AI collaborative filtering.
  • Figure 4: Overall approach of our multi-agent, voting-based approach to automate the validation of generated questions.
  • ...and 14 more figures