Table of Contents
Fetching ...

AstroMLab 4: Benchmark-Topping Performance in Astronomy Q&A with a 70B-Parameter Domain-Specialized Reasoning Model

Tijmen de Haan, Yuan-Sen Ting, Tirthankar Ghosal, Tuan Dung Nguyen, Alberto Accomazzi, Emily Herron, Vanessa Lama, Rui Pan, Azton Wells, Nesar Ramachandra

TL;DR

The paper demonstrates that domain specialization, when scaled to a 70B parameter model, can surpass leading generalist systems in astronomy. It achieves this through a three‑stage pipeline—continued pre‑training, supervised fine‑tuning with a reasoning‑oriented and domain‑rich corpus, and careful model merging (DARE‑TIES)—and by enabling explicit reasoning with <think> traces. On the AstroMLab‑1 benchmark, AstroSage‑Llama‑3.1‑70B attains 86.2% accuracy, outperforming both open‑weight and costly proprietary competitors, with around two orders of magnitude greater cost efficiency. The work also emphasizes open availability under a permissive license to democratize access and accelerate astronomical research and education. Future directions include developing astronomy‑specific reasoning benchmarks and integrating the model with domain tools to create more capable AI research assistants for astronomy.

Abstract

General-purpose large language models, despite their broad capabilities, often struggle with specialized domain knowledge, a limitation particularly pronounced in more accessible, lower-parameter versions. This gap hinders their deployment as effective agents in demanding fields such as astronomy. Building on our prior work with AstroSage-8B, this study introduces AstroSage-70B, a significantly larger and more advanced domain-specialized natural-language AI assistant. It is designed for research and education across astronomy, astrophysics, space science, astroparticle physics, cosmology, and astronomical instrumentation. Developed from the Llama-3.1-70B foundation, AstroSage-70B underwent extensive continued pre-training on a vast corpus of astronomical literature, followed by supervised fine-tuning and model merging. Beyond its 70-billion parameter scale, this model incorporates refined datasets, judiciously chosen learning hyperparameters, and improved training procedures, achieving state-of-the-art performance on complex astronomical tasks. Notably, we integrated reasoning chains into the SFT dataset, enabling AstroSage-70B to either answer the user query immediately, or first emit a human-readable thought process. Evaluated on the AstroMLab-1 benchmark -- comprising 4,425 questions from literature withheld during training -- AstroSage-70B achieves state-of-the-art performance. It surpasses all other tested open-weight and proprietary models, including leading systems like o3, Gemini-2.5-Pro, Claude-3.7-Sonnet, Deepseek-R1, and Qwen-3-235B, even those with API costs two orders of magnitude higher. This work demonstrates that domain specialization, when applied to large-scale models, can enable them to outperform generalist counterparts in specialized knowledge areas like astronomy, thereby advancing the frontier of AI capabilities in the field.

AstroMLab 4: Benchmark-Topping Performance in Astronomy Q&A with a 70B-Parameter Domain-Specialized Reasoning Model

TL;DR

The paper demonstrates that domain specialization, when scaled to a 70B parameter model, can surpass leading generalist systems in astronomy. It achieves this through a three‑stage pipeline—continued pre‑training, supervised fine‑tuning with a reasoning‑oriented and domain‑rich corpus, and careful model merging (DARE‑TIES)—and by enabling explicit reasoning with <think> traces. On the AstroMLab‑1 benchmark, AstroSage‑Llama‑3.1‑70B attains 86.2% accuracy, outperforming both open‑weight and costly proprietary competitors, with around two orders of magnitude greater cost efficiency. The work also emphasizes open availability under a permissive license to democratize access and accelerate astronomical research and education. Future directions include developing astronomy‑specific reasoning benchmarks and integrating the model with domain tools to create more capable AI research assistants for astronomy.

Abstract

General-purpose large language models, despite their broad capabilities, often struggle with specialized domain knowledge, a limitation particularly pronounced in more accessible, lower-parameter versions. This gap hinders their deployment as effective agents in demanding fields such as astronomy. Building on our prior work with AstroSage-8B, this study introduces AstroSage-70B, a significantly larger and more advanced domain-specialized natural-language AI assistant. It is designed for research and education across astronomy, astrophysics, space science, astroparticle physics, cosmology, and astronomical instrumentation. Developed from the Llama-3.1-70B foundation, AstroSage-70B underwent extensive continued pre-training on a vast corpus of astronomical literature, followed by supervised fine-tuning and model merging. Beyond its 70-billion parameter scale, this model incorporates refined datasets, judiciously chosen learning hyperparameters, and improved training procedures, achieving state-of-the-art performance on complex astronomical tasks. Notably, we integrated reasoning chains into the SFT dataset, enabling AstroSage-70B to either answer the user query immediately, or first emit a human-readable thought process. Evaluated on the AstroMLab-1 benchmark -- comprising 4,425 questions from literature withheld during training -- AstroSage-70B achieves state-of-the-art performance. It surpasses all other tested open-weight and proprietary models, including leading systems like o3, Gemini-2.5-Pro, Claude-3.7-Sonnet, Deepseek-R1, and Qwen-3-235B, even those with API costs two orders of magnitude higher. This work demonstrates that domain specialization, when applied to large-scale models, can enable them to outperform generalist counterparts in specialized knowledge areas like astronomy, thereby advancing the frontier of AI capabilities in the field.

Paper Structure

This paper contains 10 sections, 3 figures.

Figures (3)

  • Figure 1: Composition of the AstroSage-Llama-3.1-70B SFT training dataset. The combination of reasoning-focused datasets (41.8%) with domain-specific astronomy Q&A (30.8%) reflects our strategy to develop a model that combines analytical thinking with specialized knowledge. General instruction-following data (OpenHermes 2.5) helps maintain versatility while preserving domain expertise. File sizes represent uncompressed UTF-8 encoded text including system and user prompts.
  • Figure 2: Training dynamics for continued pre-training (CPT) and supervised fine-tuning (SFT). The top panel shows the training loss trajectory across 2.5 epochs of CPT followed by 0.6 epochs of SFT. Despite an early significant spike during CPT Epoch 1, the loss steadily decreases throughout training with minimal discontinuity at epoch boundaries, indicating effective learning without overfitting. The SFT phase exhibits a rapid initial decrease in loss, eventually reaching below 0.6. The bottom panel reveals the learning rate schedule, including the initial warm-up period, several manual adjustments during early CPT (visible as step changes), the planned cosine decay during the final partial CPT epoch, and a separate warm-up and decay cycle for the SFT phase. Early termination of both phases reflects the computational resource constraints.
  • Figure 3: Performance comparison on the AstroMLab-1 benchmark across 38 LLMs as of May 2025. The x-axis shows cost per 0.1M tokens (USD) on a logarithmic scale, while the y-axis shows accuracy percentage. AstroSage-Llama-3.1-70B achieves 86.2%, outperforming all other models including more expensive proprietary offerings like o3, Claude-3.7-Sonnet, and GPT-4.1. The diagonal dashed lines represent cost-efficiency trade-offs, where a tenfold increase in cost typically corresponds to a 3.5 percentage point improvement in accuracy. The vertical red arrows highlight the effect of domain specialization, showing that both AstroSage models jump approximately two cost-efficiency lines compared to their base models (Llama-3.1-8B and Llama-3.1-70B), representing a roughly $100\times$ improvement in cost-efficiency. The Wilson Score interval in the bottom right shows the typical uncertainty due to the finite number of questions. Figure adapted from tingAstroMLab1Who2024.