Designing an Evaluation Framework for Large Language Models in Astronomy Research
John F. Wu, Alina Hyk, Kiera McCormick, Christine Ye, Simone Astarita, Elina Baral, Jo Ciuca, Jesse Cranney, Anjalie Field, Kartheik Iyer, Philipp Koehn, Jenn Kotler, Sandor Kruk, Michelle Ntampaka, Charles O'Neill, Joshua E. G. Peek, Sanjib Sharma, Mikaeel Yunus
TL;DR
The paper addresses the lack of standards for evaluating LLM-assisted astronomy research by proposing a dynamic, real-world evaluation framework. It designs a Retrieval-Augmented Generation (RAG) chatbot grounded in astro-ph arXiv papers, deployed via Slack to collect rich user interactions, feedback, and retrieval data. The contribution comprises an end-to-end experimental design, data schemas, and an IRB-approved plan to study how astronomers interact with and benefit from LLMs. This framework enables iterative improvements to LLM tools in astronomy and offers a path for future evaluation studies across subfields. The work highlights the importance of grounding LLM outputs with domain-specific literature and capturing user-centered metrics to inform practical deployment.
Abstract
Large Language Models (LLMs) are shifting how scientific research is done. It is imperative to understand how researchers interact with these models and how scientific sub-communities like astronomy might benefit from them. However, there is currently no standard for evaluating the use of LLMs in astronomy. Therefore, we present the experimental design for an evaluation study on how astronomy researchers interact with LLMs. We deploy a Slack chatbot that can answer queries from users via Retrieval-Augmented Generation (RAG); these responses are grounded in astronomy papers from arXiv. We record and anonymize user questions and chatbot answers, user upvotes and downvotes to LLM responses, user feedback to the LLM, and retrieved documents and similarity scores with the query. Our data collection method will enable future dynamic evaluations of LLM tools for astronomy.
