Table of Contents
Fetching ...

AstroMMBench: A Benchmark for Evaluating Multimodal Large Language Models Capabilities in Astronomy

Jinghang Shi, Xiaoyu Tang, Yang Huang, Yuyang Li, Xiao Kong, Yanxia Zhang, Caizhan Yue

TL;DR

AstroMMBench addresses the need for domain-specific evaluation of multimodal LLMs in astronomy by introducing a 621-question benchmark spanning six astrophysical subfields, derived from automated question generation on arXiv image-text pairs and validated by 15 domain experts. The study evaluates 25 MLLMs (22 open-source and 3 closed-source) using the VLMEvalKit framework, finding that Ovis2-34B achieves the highest overall accuracy at 70.53%, with strong performance even against top closed models. A positive, though not universal, correlation exists between general multimodal capabilities and astronomy-specific performance (Pearson r ≈ 0.82), underscoring both transferability and the unique demands of astronomical data. The results reveal subfield variability, with cosmology and high-energy astrophysics posing greater challenges than instrumentation and solar physics, and highlight the need for domain-focused benchmarks to guide model development and safe deployment. AstroMMBench is presented as a dynamic resource to catalyze AI-astronomy integration and ongoing improvements in MLLM capabilities for scientific tasks.

Abstract

Astronomical image interpretation presents a significant challenge for applying multimodal large language models (MLLMs) to specialized scientific tasks. Existing benchmarks focus on general multimodal capabilities but fail to capture the complexity of astronomical data. To bridge this gap, we introduce AstroMMBench, the first comprehensive benchmark designed to evaluate MLLMs in astronomical image understanding. AstroMMBench comprises 621 multiple-choice questions across six astrophysical subfields, curated and reviewed by 15 domain experts for quality and relevance. We conducted an extensive evaluation of 25 diverse MLLMs, including 22 open-source and 3 closed-source models, using AstroMMBench. The results show that Ovis2-34B achieved the highest overall accuracy (70.5%), demonstrating leading capabilities even compared to strong closed-source models. Performance showed variations across the six astrophysical subfields, proving particularly challenging in domains like cosmology and high-energy astrophysics, while models performed relatively better in others, such as instrumentation and solar astrophysics. These findings underscore the vital role of domain-specific benchmarks like AstroMMBench in critically evaluating MLLM performance and guiding their targeted development for scientific applications. AstroMMBench provides a foundational resource and a dynamic tool to catalyze advancements at the intersection of AI and astronomy.

AstroMMBench: A Benchmark for Evaluating Multimodal Large Language Models Capabilities in Astronomy

TL;DR

AstroMMBench addresses the need for domain-specific evaluation of multimodal LLMs in astronomy by introducing a 621-question benchmark spanning six astrophysical subfields, derived from automated question generation on arXiv image-text pairs and validated by 15 domain experts. The study evaluates 25 MLLMs (22 open-source and 3 closed-source) using the VLMEvalKit framework, finding that Ovis2-34B achieves the highest overall accuracy at 70.53%, with strong performance even against top closed models. A positive, though not universal, correlation exists between general multimodal capabilities and astronomy-specific performance (Pearson r ≈ 0.82), underscoring both transferability and the unique demands of astronomical data. The results reveal subfield variability, with cosmology and high-energy astrophysics posing greater challenges than instrumentation and solar physics, and highlight the need for domain-focused benchmarks to guide model development and safe deployment. AstroMMBench is presented as a dynamic resource to catalyze AI-astronomy integration and ongoing improvements in MLLM capabilities for scientific tasks.

Abstract

Astronomical image interpretation presents a significant challenge for applying multimodal large language models (MLLMs) to specialized scientific tasks. Existing benchmarks focus on general multimodal capabilities but fail to capture the complexity of astronomical data. To bridge this gap, we introduce AstroMMBench, the first comprehensive benchmark designed to evaluate MLLMs in astronomical image understanding. AstroMMBench comprises 621 multiple-choice questions across six astrophysical subfields, curated and reviewed by 15 domain experts for quality and relevance. We conducted an extensive evaluation of 25 diverse MLLMs, including 22 open-source and 3 closed-source models, using AstroMMBench. The results show that Ovis2-34B achieved the highest overall accuracy (70.5%), demonstrating leading capabilities even compared to strong closed-source models. Performance showed variations across the six astrophysical subfields, proving particularly challenging in domains like cosmology and high-energy astrophysics, while models performed relatively better in others, such as instrumentation and solar astrophysics. These findings underscore the vital role of domain-specific benchmarks like AstroMMBench in critically evaluating MLLM performance and guiding their targeted development for scientific applications. AstroMMBench provides a foundational resource and a dynamic tool to catalyze advancements at the intersection of AI and astronomy.

Paper Structure

This paper contains 31 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: Distribution of questions across astronomy subfields in AstroMMBench.
  • Figure 2: Examples of randomly selected questions in AstroMMBench.
  • Figure 3: Automated pipeline for multiple-choice question generation and review. The pipeline is divided into two stages. (a) The initial stage involves the autogeneration of multiple-choice questions. Llama-3.3-70B-Instruct refines textual descriptions associated with astronomical images, while InternVL2.5-78B generates corresponding questions. (b) The second stage is the review process, where the generated questions undergo filtering by large language models (LLMs) and expert evaluation to ensure the quality, correctness, and relevance of both the questions and answers before their inclusion in the final benchmark.
  • Figure 4: Distribution of question difficulty in AstroMMBench, based on the number of evaluated models that correctly answered each question. The x-axis indicates the "Number of Models Correctly Answering" a question (0-25), and the y-axis shows the count of questions at each correctness level, broken down by subfield.
  • Figure 5: Relationship between general multimodal performance (OpenCompass score) and specialized astronomical image interpretation performance (AstroMMBench overall accuracy) for 22 MLLMs. Point size represents model scale (parameter count)
  • ...and 1 more figures