AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy

Rui Pan; Tuan Dung Nguyen; Hardik Arora; Alberto Accomazzi; Tirthankar Ghosal; Yuan-Sen Ting

AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy

Rui Pan, Tuan Dung Nguyen, Hardik Arora, Alberto Accomazzi, Tirthankar Ghosal, Yuan-Sen Ting

TL;DR

It is found that the previously released AstroLLaMA series, based on LLaMA-2-7B, underperforms compared to the base model, but this performance degradation can be partially mitigated by utilizing high-quality data for continual pretraining, such as summarized text from arXiv.

Abstract

Continual pretraining of large language models on domain-specific data has been proposed to enhance performance on downstream tasks. In astronomy, the previous absence of astronomy-focused benchmarks has hindered objective evaluation of these specialized LLM models. Leveraging a recent initiative to curate high-quality astronomical MCQs, this study aims to quantitatively assess specialized LLMs in astronomy. We find that the previously released AstroLLaMA series, based on LLaMA-2-7B, underperforms compared to the base model. We demonstrate that this performance degradation can be partially mitigated by utilizing high-quality data for continual pretraining, such as summarized text from arXiv. Despite the observed catastrophic forgetting in smaller models, our results indicate that continual pretraining on the 70B model can yield significant improvements. However, the current supervised fine-tuning dataset still constrains the performance of instruct models. In conjunction with this study, we introduce a new set of models, AstroLLaMA-3-8B and AstroLLaMA-2-70B, building upon the previous AstroLLaMA series.

AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy

TL;DR

Abstract

Paper Structure (14 sections, 1 figure, 1 table)

This paper contains 14 sections, 1 figure, 1 table.

Introduction
Existing Specialized LLMs for Astronomy
Extending AstroLLaMA: AstroLLaMA-2-70B and AstroLLaMA-3-8B
Benchmark MCQ Datasets
Inference Methodology
Full Instruct Benchmarking Method
Base Model Token Benchmarking Method
Instruct Model Token Benchmarking Method
Results
Discussion and Conclusion
Broader Impact
Example of Benchmark Questions
Full Instruct Benchmarking Method
Next Token Benchmarking Method

Figures (1)

Figure 1: Performance comparison of baseline LLaMA models and their specialized AstroLLaMA counterparts on astronomy MCQ benchmarks. Scores are shown as percentages for different prompting styles: full instruction-following, next-token prediction (instruct model), and next-token prediction (base model), represented by three different symbols. Horizontal lines indicate the full instruct scores of native LLaMA models for each corresponding series the AstroLLaMA is trained on. The existing AstroLLaMA-2-7B shows a notable decrement in ability. The AstroLLaMA-3-8B in this study mitigates that problem; however, we find that training on astro-ph data alone fails to improve the performance of the 8B models. This study introduces the first specialized LLM in astronomy at the 70B parameter level. The AstroLLaMA-2-70B outperforms the native LLaMA-2-70B models, demonstrating that training on astro-ph data can improve knowledge recall performance at the 70B level. Notably, across all models, the instruct versions, especially when evaluated using the full instruct benchmarking method, perform worse than the next-token prediction task. This suggests that the current lack of astronomy-focused Q&A for SFT poses challenges in developing a useful assistant model from the specialized base model.

AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy

TL;DR

Abstract

AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy

Authors

TL;DR

Abstract

Table of Contents

Figures (1)