Table of Contents
Fetching ...

AstroLLaVA: towards the unification of astronomical data and natural language

Sharaf Zaman, Michael J. Smith, Pranav Khetarpal, Rishabh Chakrabarty, Michele Ginolfi, Marc Huertas-Company, Maja Jabłońska, Sandor Kruk, Matthieu Le Lain, Sergio José Rodríguez Méndez, Dimitrios Tanoglidis

TL;DR

AstroLLaVA presents a domain-specialized vision-language model for astronomy by adapting the LLaVA framework to ground imagery in astronomical knowledge. It leverages a two-stage fine-tuning regime and a diverse outreach-focused dataset (APOD, ESO, HST) plus synthetic QA data to enable captioning and visual question answering in the astronomy domain. The authors release the model, code, and training data to foster open-source development and outline a roadmap for expanding to multi-modal astronomical data and broader collaboration. This work demonstrates a viable path toward multimodal astronomical foundations, with practical implications for data accessibility and future scientific workflows, while highlighting current limitations and directions for improving visual encoders specialized for scientific imagery.

Abstract

We present AstroLLaVA, a vision language model for astronomy that enables interaction with astronomical imagery through natural dialogue. By fine-tuning the LLaVA model on a diverse dataset of $\sim$30k images with captions and question-answer pairs sourced from NASA's `Astronomy Picture of the Day', the European Southern Observatory, and the NASA/ESA Hubble Space Telescope, we create a model capable of answering open-ended questions about astronomical concepts depicted visually. Our two-stage fine-tuning process adapts the model to both image captioning and visual question answering in the astronomy domain. We demonstrate AstroLLaVA's performance on an astronomical visual question answering benchmark and release the model weights, code, and training set to encourage further open source work in this space. Finally, we suggest a roadmap towards general astronomical data alignment with pre-trained language models, and provide an open space for collaboration towards this end for interested researchers.

AstroLLaVA: towards the unification of astronomical data and natural language

TL;DR

AstroLLaVA presents a domain-specialized vision-language model for astronomy by adapting the LLaVA framework to ground imagery in astronomical knowledge. It leverages a two-stage fine-tuning regime and a diverse outreach-focused dataset (APOD, ESO, HST) plus synthetic QA data to enable captioning and visual question answering in the astronomy domain. The authors release the model, code, and training data to foster open-source development and outline a roadmap for expanding to multi-modal astronomical data and broader collaboration. This work demonstrates a viable path toward multimodal astronomical foundations, with practical implications for data accessibility and future scientific workflows, while highlighting current limitations and directions for improving visual encoders specialized for scientific imagery.

Abstract

We present AstroLLaVA, a vision language model for astronomy that enables interaction with astronomical imagery through natural dialogue. By fine-tuning the LLaVA model on a diverse dataset of 30k images with captions and question-answer pairs sourced from NASA's `Astronomy Picture of the Day', the European Southern Observatory, and the NASA/ESA Hubble Space Telescope, we create a model capable of answering open-ended questions about astronomical concepts depicted visually. Our two-stage fine-tuning process adapts the model to both image captioning and visual question answering in the astronomy domain. We demonstrate AstroLLaVA's performance on an astronomical visual question answering benchmark and release the model weights, code, and training set to encourage further open source work in this space. Finally, we suggest a roadmap towards general astronomical data alignment with pre-trained language models, and provide an open space for collaboration towards this end for interested researchers.

Paper Structure

This paper contains 10 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Diagram of our AstroLLaVA model at inference time. $f_\phi$ in this study's case is LLaMA 7B ref_touvron2023, $g_\phi$ is CLIP ViT-L/14 ref_radford2021.
  • Figure 2: Example prompt to generate a conversation for the APOD published on 2022-05-01, titled 'First Horizon-Scale Image of a Black Hole'. Similar prompts are used for all our datasets.
  • Figure 3: Exemplar galaxy images for each class in the 'Galaxy 10 DECaLS' dataset. This figure is taken from https://github.com/henrysky/Galaxy10.