Table of Contents
Fetching ...

AstroMLab 1: Who Wins Astronomy Jeopardy!?

Yuan-Sen Ting, Tuan Dung Nguyen, Tirthankar Ghosal, Rui Pan, Hardik Arora, Zechang Sun, Tijmen de Haan, Nesar Ramachandra, Azton Wells, Sandeep Madireddy, Alberto Accomazzi

TL;DR

The rapid progress observed suggests that LLM-driven research in astronomy may become feasible in the near future, and the development for fast, low-cost inference of open-weights models presents new opportunities for affordable deployment in astronomy.

Abstract

We present a comprehensive evaluation of proprietary and open-weights large language models using the first astronomy-specific benchmarking dataset. This dataset comprises 4,425 multiple-choice questions curated from the Annual Review of Astronomy and Astrophysics, covering a broad range of astrophysical topics. Our analysis examines model performance across various astronomical subfields and assesses response calibration, crucial for potential deployment in research environments. Claude-3.5-Sonnet outperforms competitors by up to 4.6 percentage points, achieving 85.0% accuracy. For proprietary models, we observed a universal reduction in cost every 3-to-12 months to achieve similar score in this particular astronomy benchmark. open-weights models have rapidly improved, with LLaMA-3-70b (80.6%) and Qwen-2-72b (77.7%) now competing with some of the best proprietary models. We identify performance variations across topics, with non-English-focused models generally struggling more in exoplanet-related fields, stellar astrophysics, and instrumentation related questions. These challenges likely stem from less abundant training data, limited historical context, and rapid recent developments in these areas. This pattern is observed across both open-weights and proprietary models, with regional dependencies evident, highlighting the impact of training data diversity on model performance in specialized scientific domains. Top-performing models demonstrate well-calibrated confidence, with correlations above 0.9 between confidence and correctness, though they tend to be slightly underconfident. The development for fast, low-cost inference of open-weights models presents new opportunities for affordable deployment in astronomy. The rapid progress observed suggests that LLM-driven research in astronomy may become feasible in the near future.

AstroMLab 1: Who Wins Astronomy Jeopardy!?

TL;DR

The rapid progress observed suggests that LLM-driven research in astronomy may become feasible in the near future, and the development for fast, low-cost inference of open-weights models presents new opportunities for affordable deployment in astronomy.

Abstract

We present a comprehensive evaluation of proprietary and open-weights large language models using the first astronomy-specific benchmarking dataset. This dataset comprises 4,425 multiple-choice questions curated from the Annual Review of Astronomy and Astrophysics, covering a broad range of astrophysical topics. Our analysis examines model performance across various astronomical subfields and assesses response calibration, crucial for potential deployment in research environments. Claude-3.5-Sonnet outperforms competitors by up to 4.6 percentage points, achieving 85.0% accuracy. For proprietary models, we observed a universal reduction in cost every 3-to-12 months to achieve similar score in this particular astronomy benchmark. open-weights models have rapidly improved, with LLaMA-3-70b (80.6%) and Qwen-2-72b (77.7%) now competing with some of the best proprietary models. We identify performance variations across topics, with non-English-focused models generally struggling more in exoplanet-related fields, stellar astrophysics, and instrumentation related questions. These challenges likely stem from less abundant training data, limited historical context, and rapid recent developments in these areas. This pattern is observed across both open-weights and proprietary models, with regional dependencies evident, highlighting the impact of training data diversity on model performance in specialized scientific domains. Top-performing models demonstrate well-calibrated confidence, with correlations above 0.9 between confidence and correctness, though they tend to be slightly underconfident. The development for fast, low-cost inference of open-weights models presents new opportunities for affordable deployment in astronomy. The rapid progress observed suggests that LLM-driven research in astronomy may become feasible in the near future.
Paper Structure (32 sections, 2 equations, 12 figures, 7 tables)

This paper contains 32 sections, 2 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Benchmarking scores of proprietary large language models for MCQ answering in astronomical research. The left panel groups models by series, with darker shades indicating more recent or larger models within each series. We tested GPT-3.5, GPT-4, GPT-4o, Claude-2.0, Claude-3.0 (Haiku, Sonnet, Opus), Claude-3.5-Sonnet, Gemini-1.0-Pro, Gemini-1.5 (Flash, Pro), GLM-3, GLM-4 (Flash, Air, AirX, 0520), Ernie-3.5, Ernie-4.0, Deepseek-v2, Step-1, Step-2, Doubao (Lite, Pro), ABAB-5.5, ABAB-6.5, Yi (Medium, Large), and Moonshot-v1. Claude-3.5-Sonnet performs best with an 85.0% accuracy, outperforming the closest non-Anthropic competitor, GPT-4o, by 4.6 percentage points. Among other leading models, GLM-4-0520 achieves 75.1%, showing a gap of 9.9 percentage points from the top performer. Interestingly, while many cutting-edge models perform similarly in general benchmarks, they show significant variability in this niche astronomical research question-answering task. The performance gap can be as large as 14.9 percentage points (between Claude-3.5-Sonnet and Doubao-Pro), demonstrating the need for domain-specific benchmarks. The right panel shows the same scores sorted by overall performance, regardless of model series, highlighting the wide range of capabilities across different models in this specific task. The error bars in the right panel display the Wilson Score Interval as uncertainties ($\pm 0.6-0.8$ percentage points) for three representative models, reflecting the statistical variation due to the finite set of 4,425 questions.
  • Figure 2: Cost and performance trade-off in astronomical Q&A. The dual x-axes show the cost per 0.1 million tokens (typical for agent deployment on one astronomical source; see text for details) and the cost to process an ArXiv astro-ph worth of tokens ($\sim$3B tokens). We use the average of input and output token costs based on June 2024 prices. To avoid crowding, only representative models are shown; full performances are in Tables \ref{['table1']} and \ref{['table2']}. Models with an outer circle indicate open-weights models run on low-cost APIs, leveraging recent specialized GPU developments for transformers. Left arrows indicate cheaper open-weights models below the plot's lower bound. Dotted lines of the same color connect models of the same series. Generally, within a series, there's a 10-fold cost increase for a 3.5-point accuracy improvement. Dashed guidelines show equivalent performance accounting for cost trade-offs, with the bold dashed line showing GPT-4o's value. Claude-3.5-Sonnet outperforms others models. LLaMA-3-70B is the only model in the same tier, albeit with lower performance. Second-tier models include Gemman-2-9B, Gemma-2-27B, Qwen-2-72B, Claude-3.0-Haiku, performing similarly to GPT-4o and Claude-3.0-Opus when price is considered. However, for GPT-4-like performance ($>$80% accuracy), only Claude-3.5-Sonnet, Claude-3.0-Opus, and LLaMA-3-70B qualify. A representative the Wilson Score Interval as uncertainty, calculated for a 75% accuracy rate over the 4,425 questions, is displayed in the bottom right corner for reference.
  • Figure 3: The Cost Efficiency Improvement Rate for Proprietary Models. This figure demonstrates the trade-off between astronomical MCQ answering accuracy and price for representative examples of proprietary models that have released multiple series: OpenAI/GPT, Anthropic/Claude, Google/Gemini, and Zhipu/GLM. The size of each point represents the recency of the model's release, with larger points indicating more recent releases. The dashed lines, similar to Fig. \ref{['fig2']}, show improvements in cost-efficiency, where moving up one line represents a 3.5-point increase in score for the same cost, or equivalently, a 10x improvement in value for the same performance (see text for details). For all these models, we observe rapid improvements in performance and cost-efficiency over time: Gemini improved equivalent to about a 100x improvement in cost-efficiency in three months (Gemini-1.0 to Gemini-1.5). Claude progressed to a 10x improvement in cost-efficiency over about 3 months (Claude-3.0 to Claude-3.5). The GPT series improved by 30x in cost efficiency over about 14 months (GPT-3.5 and GPT-4 to GPT-4o). The GLM series shows improvements of about 10-100x in cost efficiency within 6 months (GLM-3 to GLM-4).
  • Figure 4: Performance of Selected Proprietary Large Language Models on Astronomy Multiple Choice Questions by Subfield Topic. The results are shown in the radar chart, with concentric circles representing different performance levels. The six categories for topics follow the subcategorization of the ArXiv astro-ph classification: 'Solar and Stellar Astrophysics', 'Earth and Planetary Astrophysics', 'Astrophysics of Galaxies', 'Cosmology and Nongalactic Astrophysics', 'High Energy Astrophysics', and 'Instrumentation and Methods for Astrophysics'. The left panel shows the results from Claude-3.5-Sonnet, GPT-4o, Claude-3.0-Opus, and Gemini-1.5-Pro, and the right panel features Yi-Large, Step-2, and GLM-4-0520. Despite these models performing on par with each other in general benchmarks, the latter group seems to perform worse on specialized and somewhat niche astronomical topics. There is a tentative trend of more limitations in recent astronomical research topics such as 'Solar and Stellar Astrophysics', 'Earth and Planetary Astrophysics', and 'Instrumentation and Methods for Astrophysics'. This suggests that part of the degradation might correlate with the training sets adopted in these different models, affecting their performance on more specialized topics. The full results of all other models are listed in Table \ref{['table3']}.
  • Figure 5: Performance of Selected Proprietary Large Language Models by Tested Ability on Astronomy Multiple Choice Questions. This plot is similar to Fig.\ref{['fig4']}, except here we categorize the questions based on their tested ability rather than subfield topics. The five classes are 'Understanding Fundamental Concepts', 'Technical and Observational Techniques', 'Analytical and Reasoning Skills', 'Historical and Theoretical Knowledge', and 'Current Research and Advanced Topics'. Despite these models performing comparably in general benchmarks, as shown in the right panel, the non-English-focused models exhibit more significant degradation in 'Historical and Theoretical Knowledge', and occasionally in 'Current Research and Advanced Topics' compared to the English-focused models as shown on the left panel. This further suggests that the limitations observed in Fig.\ref{['fig4']} may stem from differences in training data among these models, affecting their performance on more specialized astronomical topics. The complete results for all other models are presented in Table \ref{['table5']}.
  • ...and 7 more figures