Table of Contents
Fetching ...

AstroLLaMA-Chat: Scaling AstroLLaMA with Conversational and Diverse Datasets

Ernest Perkowski, Rui Pan, Tuan Dung Nguyen, Yuan-Sen Ting, Sandor Kruk, Tong Zhang, Charlie O'Neill, Maja Jablonska, Zechang Sun, Michael J. Smith, Huiling Liu, Kevin Schawinski, Kartheik Iyer, Ioana Ciucă for UniverseTBD

TL;DR

The paper investigates leveraging continual pre-training on a curated astronomy corpus to improve specialized QA with a compact $7$B LLaMA-2 model, resulting in AstroLLaMA-Chat, a chat-enabled tool released openly on HuggingFace. While GPT-4 and larger LLaMA variants maintain strong general reasoning, AstroLLaMA-Chat demonstrates superior performance in highly specialized topics such as elemental-abundance dimensionality and parity-violation cosmology, aided by a domain-focused QA dataset generated from over $300,000$ arXiv papers. The approach combines extraction of abstracts, introductions, and conclusions via regex, synthetic QA pairs from GPT-4, and LMFlow-based fine-tuning (with Flash Attention, ZeRO, and long-context techniques), achieving substantial training efficiency (approximately a 5x speedup over prior methods). A $70$B version is planned for release in the full paper, and ongoing benchmarking will be detailed later, with the 7B model available for community use to promote accessible, domain-specific conversational AI for astronomy.

Abstract

We explore the potential of enhancing LLM performance in astronomy-focused question-answering through targeted, continual pre-training. By employing a compact 7B-parameter LLaMA-2 model and focusing exclusively on a curated set of astronomy corpora -- comprising abstracts, introductions, and conclusions -- we achieve notable improvements in specialized topic comprehension. While general LLMs like GPT-4 excel in broader question-answering scenarios due to superior reasoning capabilities, our findings suggest that continual pre-training with limited resources can still enhance model performance on specialized topics. Additionally, we present an extension of AstroLLaMA: the fine-tuning of the 7B LLaMA model on a domain-specific conversational dataset, culminating in the release of the chat-enabled AstroLLaMA for community use. Comprehensive quantitative benchmarking is currently in progress and will be detailed in an upcoming full paper. The model, AstroLLaMA-Chat, is now available at https://huggingface.co/universeTBD, providing the first open-source conversational AI tool tailored for the astronomy community.

AstroLLaMA-Chat: Scaling AstroLLaMA with Conversational and Diverse Datasets

TL;DR

The paper investigates leveraging continual pre-training on a curated astronomy corpus to improve specialized QA with a compact B LLaMA-2 model, resulting in AstroLLaMA-Chat, a chat-enabled tool released openly on HuggingFace. While GPT-4 and larger LLaMA variants maintain strong general reasoning, AstroLLaMA-Chat demonstrates superior performance in highly specialized topics such as elemental-abundance dimensionality and parity-violation cosmology, aided by a domain-focused QA dataset generated from over arXiv papers. The approach combines extraction of abstracts, introductions, and conclusions via regex, synthetic QA pairs from GPT-4, and LMFlow-based fine-tuning (with Flash Attention, ZeRO, and long-context techniques), achieving substantial training efficiency (approximately a 5x speedup over prior methods). A B version is planned for release in the full paper, and ongoing benchmarking will be detailed later, with the 7B model available for community use to promote accessible, domain-specific conversational AI for astronomy.

Abstract

We explore the potential of enhancing LLM performance in astronomy-focused question-answering through targeted, continual pre-training. By employing a compact 7B-parameter LLaMA-2 model and focusing exclusively on a curated set of astronomy corpora -- comprising abstracts, introductions, and conclusions -- we achieve notable improvements in specialized topic comprehension. While general LLMs like GPT-4 excel in broader question-answering scenarios due to superior reasoning capabilities, our findings suggest that continual pre-training with limited resources can still enhance model performance on specialized topics. Additionally, we present an extension of AstroLLaMA: the fine-tuning of the 7B LLaMA model on a domain-specific conversational dataset, culminating in the release of the chat-enabled AstroLLaMA for community use. Comprehensive quantitative benchmarking is currently in progress and will be detailed in an upcoming full paper. The model, AstroLLaMA-Chat, is now available at https://huggingface.co/universeTBD, providing the first open-source conversational AI tool tailored for the astronomy community.
Paper Structure (4 sections, 1 figure)

This paper contains 4 sections, 1 figure.

Figures (1)

  • Figure 1: Demonstration of AstroLLaMA-Chat's Capabilities. While general large language models like GPT-4 continue to exhibit robust reasoning and Q&A abilities, even in specialized domains such as astronomy, our study highlights the benefits of continual pre-training on a dedicated astronomy corpus from arXiv, enriched with the latest data. This approach gives AstroLLaMA-Chat an edge in two specific areas. The top example illustrates its performance in a highly specialized topic within astronomy. AstroLLaMA-Chat demonstrates a better understanding of the complexities involved in studying the dimensionality of elemental abundance in stars, reflecting the true chemical yield channels. It also outlines prevalent methods in this specialized area. In contrast, GPT-4 and the LLaMA-2-7b model, from which AstroLLaMA is derived, often provide responses that lack depth in understanding this field. The bottom panel illustrates AstroLLaMA-Chat's adeptness in addressing contemporary and dynamic research areas, notably the burgeoning field of parity violation studies in cosmology. While it captures some of the latest directions in the field (though with occasional detail inaccuracies), both GPT-4 and LLaMA-2 tend to diverge into broader implications and detection methods, failing to encapsulate the current focus of the field.