AstroMMBench: A Benchmark for Evaluating Multimodal Large Language Models Capabilities in Astronomy

Jinghang Shi; Xiaoyu Tang; Yang Huang; Yuyang Li; Xiao Kong; Yanxia Zhang; Caizhan Yue

AstroMMBench: A Benchmark for Evaluating Multimodal Large Language Models Capabilities in Astronomy

Jinghang Shi, Xiaoyu Tang, Yang Huang, Yuyang Li, Xiao Kong, Yanxia Zhang, Caizhan Yue

TL;DR

AstroMMBench addresses the need for domain-specific evaluation of multimodal LLMs in astronomy by introducing a 621-question benchmark spanning six astrophysical subfields, derived from automated question generation on arXiv image-text pairs and validated by 15 domain experts. The study evaluates 25 MLLMs (22 open-source and 3 closed-source) using the VLMEvalKit framework, finding that Ovis2-34B achieves the highest overall accuracy at 70.53%, with strong performance even against top closed models. A positive, though not universal, correlation exists between general multimodal capabilities and astronomy-specific performance (Pearson r ≈ 0.82), underscoring both transferability and the unique demands of astronomical data. The results reveal subfield variability, with cosmology and high-energy astrophysics posing greater challenges than instrumentation and solar physics, and highlight the need for domain-focused benchmarks to guide model development and safe deployment. AstroMMBench is presented as a dynamic resource to catalyze AI-astronomy integration and ongoing improvements in MLLM capabilities for scientific tasks.

Abstract

Astronomical image interpretation presents a significant challenge for applying multimodal large language models (MLLMs) to specialized scientific tasks. Existing benchmarks focus on general multimodal capabilities but fail to capture the complexity of astronomical data. To bridge this gap, we introduce AstroMMBench, the first comprehensive benchmark designed to evaluate MLLMs in astronomical image understanding. AstroMMBench comprises 621 multiple-choice questions across six astrophysical subfields, curated and reviewed by 15 domain experts for quality and relevance. We conducted an extensive evaluation of 25 diverse MLLMs, including 22 open-source and 3 closed-source models, using AstroMMBench. The results show that Ovis2-34B achieved the highest overall accuracy (70.5%), demonstrating leading capabilities even compared to strong closed-source models. Performance showed variations across the six astrophysical subfields, proving particularly challenging in domains like cosmology and high-energy astrophysics, while models performed relatively better in others, such as instrumentation and solar astrophysics. These findings underscore the vital role of domain-specific benchmarks like AstroMMBench in critically evaluating MLLM performance and guiding their targeted development for scientific applications. AstroMMBench provides a foundational resource and a dynamic tool to catalyze advancements at the intersection of AI and astronomy.

AstroMMBench: A Benchmark for Evaluating Multimodal Large Language Models Capabilities in Astronomy

TL;DR

Abstract

AstroMMBench: A Benchmark for Evaluating Multimodal Large Language Models Capabilities in Astronomy

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)