Table of Contents
Fetching ...

AstroLLaMA: Towards Specialized Foundation Models in Astronomy

Tuan Dung Nguyen, Yuan-Sen Ting, Ioana Ciucă, Charlie O'Neill, Ze-Chang Sun, Maja Jabłońska, Sandor Kruk, Ernest Perkowski, Jack Miller, Jason Li, Josh Peek, Kartheik Iyer, Tomasz Różański, Pranav Khetarpal, Sharaf Zaman, David Brodrick, Sergio J. Rodríguez Méndez, Thang Bui, Alyssa Goodman, Alberto Accomazzi, Jill Naiman, Jesse Cranney, Kevin Schawinski, UniverseTBD

TL;DR

AstroLLaMA presents a domain-tuned 7B parameter model derived from LLaMA-2, trained on 300k astronomy abstracts to address the scarcity of astronomy-specific generative capabilities. It fine-tunes with LoRA on a 77M-token subset using 4-bit quantization on four GPUs, achieving substantial perplexity improvements. The model demonstrates superior domain-aware text generation and discriminative embedding quality compared with GPT-4 and LLaMA-2, suggesting strong utility for tasks like automatic summarization and conversational agents in astronomy. The paper also discusses limitations such as knowledge gaps and hallucinations, and outlines plans for larger corpora, alignment strategies, and community-facing releases to accelerate astronomy research.

Abstract

Large language models excel in many human-language tasks but often falter in highly specialized domains like scholarly astronomy. To bridge this gap, we introduce AstroLLaMA, a 7-billion-parameter model fine-tuned from LLaMA-2 using over 300,000 astronomy abstracts from arXiv. Optimized for traditional causal language modeling, AstroLLaMA achieves a 30% lower perplexity than Llama-2, showing marked domain adaptation. Our model generates more insightful and scientifically relevant text completions and embedding extraction than state-of-the-arts foundation models despite having significantly fewer parameters. AstroLLaMA serves as a robust, domain-specific model with broad fine-tuning potential. Its public release aims to spur astronomy-focused research, including automatic paper summarization and conversational agent development.

AstroLLaMA: Towards Specialized Foundation Models in Astronomy

TL;DR

AstroLLaMA presents a domain-tuned 7B parameter model derived from LLaMA-2, trained on 300k astronomy abstracts to address the scarcity of astronomy-specific generative capabilities. It fine-tunes with LoRA on a 77M-token subset using 4-bit quantization on four GPUs, achieving substantial perplexity improvements. The model demonstrates superior domain-aware text generation and discriminative embedding quality compared with GPT-4 and LLaMA-2, suggesting strong utility for tasks like automatic summarization and conversational agents in astronomy. The paper also discusses limitations such as knowledge gaps and hallucinations, and outlines plans for larger corpora, alignment strategies, and community-facing releases to accelerate astronomy research.

Abstract

Large language models excel in many human-language tasks but often falter in highly specialized domains like scholarly astronomy. To bridge this gap, we introduce AstroLLaMA, a 7-billion-parameter model fine-tuned from LLaMA-2 using over 300,000 astronomy abstracts from arXiv. Optimized for traditional causal language modeling, AstroLLaMA achieves a 30% lower perplexity than Llama-2, showing marked domain adaptation. Our model generates more insightful and scientifically relevant text completions and embedding extraction than state-of-the-arts foundation models despite having significantly fewer parameters. AstroLLaMA serves as a robust, domain-specific model with broad fine-tuning potential. Its public release aims to spur astronomy-focused research, including automatic paper summarization and conversational agent development.
Paper Structure (8 sections, 3 figures)

This paper contains 8 sections, 3 figures.

Figures (3)

  • Figure 1: Learning curve of AstroLLaMA during its fine-tuning on the arXiv astrophysics dataset. The Fig.tracks the evolution of perplexity, a measure of the model's next-token prediction performance. The light blue curve shows the training perplexity at each AdamW update step, while the dark black curve provides a smoothed average taken over 10-step intervals.
  • Figure 2: Completion of an abstract from the arXiv database (ID: 2306.15719) using three different models: GPT-4, LLaMA-2, and AstroLLaMA. Each model is prompted with the same short text snippet, highlighted in their respective boxes. GPT-4 tends to produce more generic statements, lacking domain-specific nuance. AstroLLaMA demonstrates the most robust completion, offering more relevant concepts and deeper insights specific to the field of astronomy, thus significantly outperforming LLaMA-2 and GPT-4.
  • Figure 3: Top: Distribution of pairwise cosine similarities among 10,000 randomly selected abstracts from our corpus, divided into 10 equal bins based on similarity levels from GPT-3. Bottom: Two representative examples illustrating divergent cosine similarity values when comparing AstroLLaMA and GPT-3 embeddings.