Designing an Evaluation Framework for Large Language Models in Astronomy Research

John F. Wu; Alina Hyk; Kiera McCormick; Christine Ye; Simone Astarita; Elina Baral; Jo Ciuca; Jesse Cranney; Anjalie Field; Kartheik Iyer; Philipp Koehn; Jenn Kotler; Sandor Kruk; Michelle Ntampaka; Charles O'Neill; Joshua E. G. Peek; Sanjib Sharma; Mikaeel Yunus

Designing an Evaluation Framework for Large Language Models in Astronomy Research

John F. Wu, Alina Hyk, Kiera McCormick, Christine Ye, Simone Astarita, Elina Baral, Jo Ciuca, Jesse Cranney, Anjalie Field, Kartheik Iyer, Philipp Koehn, Jenn Kotler, Sandor Kruk, Michelle Ntampaka, Charles O'Neill, Joshua E. G. Peek, Sanjib Sharma, Mikaeel Yunus

TL;DR

The paper addresses the lack of standards for evaluating LLM-assisted astronomy research by proposing a dynamic, real-world evaluation framework. It designs a Retrieval-Augmented Generation (RAG) chatbot grounded in astro-ph arXiv papers, deployed via Slack to collect rich user interactions, feedback, and retrieval data. The contribution comprises an end-to-end experimental design, data schemas, and an IRB-approved plan to study how astronomers interact with and benefit from LLMs. This framework enables iterative improvements to LLM tools in astronomy and offers a path for future evaluation studies across subfields. The work highlights the importance of grounding LLM outputs with domain-specific literature and capturing user-centered metrics to inform practical deployment.

Abstract

Large Language Models (LLMs) are shifting how scientific research is done. It is imperative to understand how researchers interact with these models and how scientific sub-communities like astronomy might benefit from them. However, there is currently no standard for evaluating the use of LLMs in astronomy. Therefore, we present the experimental design for an evaluation study on how astronomy researchers interact with LLMs. We deploy a Slack chatbot that can answer queries from users via Retrieval-Augmented Generation (RAG); these responses are grounded in astronomy papers from arXiv. We record and anonymize user questions and chatbot answers, user upvotes and downvotes to LLM responses, user feedback to the LLM, and retrieved documents and similarity scores with the query. Our data collection method will enable future dynamic evaluations of LLM tools for astronomy.

Designing an Evaluation Framework for Large Language Models in Astronomy Research

TL;DR

Abstract

Paper Structure (15 sections, 3 figures)

This paper contains 15 sections, 3 figures.

Introduction
Related Work
Generating robust answers with LLMs
Steering LLMs with prompting
Information retrieval
Retrieval Augmented Generation
Experimental Design
RAG with astronomy arXiv papers
Slack chatbot interactions
Compiling and Annotating User Data
Optional demographic information
Towards LLM Evaluation for Astronomy
Evaluating Research Topics
User evaluation studies
Conclusions

Figures (3)

Figure 1: A schematic showing the LLM backend for our system. First, a user query is encoded and is used to retrieve $k=5$ similar papers based on their abstracts. After concatenating the prompt string, the top-$k$ papers' abstracts, conclusions, and metadata, and the original user query, we send it to the generator LLM, which outputs a response.
Figure 2: Example user interaction with the Slack chatbot.
Figure 3: Table schema for user annotation data. Each box shows a column in a table, and arrows denote relationships between columns in different tables. Red and blue text color indicate data that are stored as floats and strings, respectively.

Designing an Evaluation Framework for Large Language Models in Astronomy Research

TL;DR

Abstract

Designing an Evaluation Framework for Large Language Models in Astronomy Research

Authors

TL;DR

Abstract

Table of Contents

Figures (3)