Table of Contents
Fetching ...

AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy

Rui Pan, Tuan Dung Nguyen, Hardik Arora, Alberto Accomazzi, Tirthankar Ghosal, Yuan-Sen Ting

TL;DR

It is found that the previously released AstroLLaMA series, based on LLaMA-2-7B, underperforms compared to the base model, but this performance degradation can be partially mitigated by utilizing high-quality data for continual pretraining, such as summarized text from arXiv.

Abstract

Continual pretraining of large language models on domain-specific data has been proposed to enhance performance on downstream tasks. In astronomy, the previous absence of astronomy-focused benchmarks has hindered objective evaluation of these specialized LLM models. Leveraging a recent initiative to curate high-quality astronomical MCQs, this study aims to quantitatively assess specialized LLMs in astronomy. We find that the previously released AstroLLaMA series, based on LLaMA-2-7B, underperforms compared to the base model. We demonstrate that this performance degradation can be partially mitigated by utilizing high-quality data for continual pretraining, such as summarized text from arXiv. Despite the observed catastrophic forgetting in smaller models, our results indicate that continual pretraining on the 70B model can yield significant improvements. However, the current supervised fine-tuning dataset still constrains the performance of instruct models. In conjunction with this study, we introduce a new set of models, AstroLLaMA-3-8B and AstroLLaMA-2-70B, building upon the previous AstroLLaMA series.

AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy

TL;DR

It is found that the previously released AstroLLaMA series, based on LLaMA-2-7B, underperforms compared to the base model, but this performance degradation can be partially mitigated by utilizing high-quality data for continual pretraining, such as summarized text from arXiv.

Abstract

Continual pretraining of large language models on domain-specific data has been proposed to enhance performance on downstream tasks. In astronomy, the previous absence of astronomy-focused benchmarks has hindered objective evaluation of these specialized LLM models. Leveraging a recent initiative to curate high-quality astronomical MCQs, this study aims to quantitatively assess specialized LLMs in astronomy. We find that the previously released AstroLLaMA series, based on LLaMA-2-7B, underperforms compared to the base model. We demonstrate that this performance degradation can be partially mitigated by utilizing high-quality data for continual pretraining, such as summarized text from arXiv. Despite the observed catastrophic forgetting in smaller models, our results indicate that continual pretraining on the 70B model can yield significant improvements. However, the current supervised fine-tuning dataset still constrains the performance of instruct models. In conjunction with this study, we introduce a new set of models, AstroLLaMA-3-8B and AstroLLaMA-2-70B, building upon the previous AstroLLaMA series.
Paper Structure (14 sections, 1 figure, 1 table)

This paper contains 14 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Performance comparison of baseline LLaMA models and their specialized AstroLLaMA counterparts on astronomy MCQ benchmarks. Scores are shown as percentages for different prompting styles: full instruction-following, next-token prediction (instruct model), and next-token prediction (base model), represented by three different symbols. Horizontal lines indicate the full instruct scores of native LLaMA models for each corresponding series the AstroLLaMA is trained on. The existing AstroLLaMA-2-7B shows a notable decrement in ability. The AstroLLaMA-3-8B in this study mitigates that problem; however, we find that training on astro-ph data alone fails to improve the performance of the 8B models. This study introduces the first specialized LLM in astronomy at the 70B parameter level. The AstroLLaMA-2-70B outperforms the native LLaMA-2-70B models, demonstrating that training on astro-ph data can improve knowledge recall performance at the 70B level. Notably, across all models, the instruct versions, especially when evaluated using the full instruct benchmarking method, perform worse than the next-token prediction task. This suggests that the current lack of astronomy-focused Q&A for SFT poses challenges in developing a useful assistant model from the specialized base model.