Table of Contents
Fetching ...

LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration

Gokce Inal, Pouyan Navard, Alper Yilmaz

Abstract

Recent advances in multimodal vision-language models (VLMs) have enabled joint reasoning over visual and textual information, yet their application to planetary science remains largely unexplored. A key hindrance is the absence of large-scale datasets that pair real planetary imagery with detailed scientific descriptions. In this work, we introduce LLaVA-LE (Large Language-and-Vision Assistant for Lunar Exploration), a vision-language model specialized for lunar surface and subsurface characterization. To enable this capability, we curate a new large-scale multimodal lunar dataset, LUCID (LUnar Caption Image Dataset) consisting of 96k high-resolution panchromatic images paired with detailed captions describing lunar terrain characteristics, and 81k question-answer (QA) pairs derived from approximately 20k images in the LUCID dataset. Leveraging this dataset, we fine-tune LLaVA using a two-stage training curriculum: (1) concept alignment for domain-specific terrain description, and (2) instruction-tuned visual question answering. We further design evaluation benchmarks spanning multiple levels of reasoning complexity relevant to lunar terrain analysis. Evaluated against GPT and Gemini judges, LLaVA-LE achieves a 3.3x overall performance gain over Base LLaVA and 2.1x over our Stage 1 model, with a reasoning score of 1.070, exceeding the judge's own reference score, highlighting the effectiveness of domain-specific multimodal data and instruction tuning to advance VLMs in planetary exploration. Code is available at https://github.com/OSUPCVLab/LLaVA-LE.

LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration

Abstract

Recent advances in multimodal vision-language models (VLMs) have enabled joint reasoning over visual and textual information, yet their application to planetary science remains largely unexplored. A key hindrance is the absence of large-scale datasets that pair real planetary imagery with detailed scientific descriptions. In this work, we introduce LLaVA-LE (Large Language-and-Vision Assistant for Lunar Exploration), a vision-language model specialized for lunar surface and subsurface characterization. To enable this capability, we curate a new large-scale multimodal lunar dataset, LUCID (LUnar Caption Image Dataset) consisting of 96k high-resolution panchromatic images paired with detailed captions describing lunar terrain characteristics, and 81k question-answer (QA) pairs derived from approximately 20k images in the LUCID dataset. Leveraging this dataset, we fine-tune LLaVA using a two-stage training curriculum: (1) concept alignment for domain-specific terrain description, and (2) instruction-tuned visual question answering. We further design evaluation benchmarks spanning multiple levels of reasoning complexity relevant to lunar terrain analysis. Evaluated against GPT and Gemini judges, LLaVA-LE achieves a 3.3x overall performance gain over Base LLaVA and 2.1x over our Stage 1 model, with a reasoning score of 1.070, exceeding the judge's own reference score, highlighting the effectiveness of domain-specific multimodal data and instruction tuning to advance VLMs in planetary exploration. Code is available at https://github.com/OSUPCVLab/LLaVA-LE.

Paper Structure

This paper contains 18 sections, 2 equations, 5 figures, 1 table.

Figures (5)

  • Figure 2: An Example of LUCID dataset: The top shows the the panchromatic image with the associated caption generated using GPT, while the below shows question-answer pairs generated from that caption.
  • Figure 3: Distribution of LUCID VQA dataset across question-answer pairs categories. This heatmap matrix shows how much each answer is contributing, in terms of percentage, to each one of the question categories across the entire LUCID visual question answering dataset. The heatmap is normalized by the row dimension and shows the reasoning capability captured in our dataset to answer each question.
  • Figure 4: Overview of LLaVA-LE Training and Inference. The right panel illustrates the two-stage training pipeline. Visual inputs $X_{\texttt{v}}$ are encoded by a frozen CLIP vision encoder $g$ to produce visual features $Z_v = g(X_{\texttt{v}})$, which are then projected into the language embedding space via a trainable projection layer, yielding $H_{\texttt{v}}$ to be concatenated with the language token embeddings $H_{\texttt{q}}$. The language model backbone $f_{\phi}$ autoregressively generates a response $X_{a}$ conditioned on the joint visual and language context. In Stage 1 (Concept Alignment), caption supervision aligns lunar visual features with domain-specific geological language. In Stage 2 (Instruction Tuning), the aligned model is further fine-tuned on multi-turn question-answer pairs to enable reasoning and conversational interaction. Throughout both stages, the CLIP vision encoder and the pretrained language model backbone $f_{\phi}$ remain frozen; only the projection layer and lightweight LoRA adaptation modules are updated. The left panel illustrates a representative inference example after Stage 2, where a user query about lunar surface observations is processed and the model generates a domain-grounded geological interpretation $X_{a}$.
  • Figure 5: Qualitative comparison of LLaVA-LE visual chat and reasoning capabilities in instruction-tuning tasks. Comparison between the response of LLaVA-13B model and our LlaVA-LE model in during instruction-tuning stage evaluation. The base LLaVA model often produces fluent but generic answers, occasionally lacking domain-specific terminology or geological interpretation. In some cases, the model hallucinates features that are not clearly implied by the scene or shifts toward broad discussions of lunar history rather than addressing the detail in the question. The Gemini and GPT are considered as the performance upper bound because the ground truth caption is fed into both of these models as context to generate the ground truth response.
  • Figure 6: Category distribution of the evaluation set: Reasoning category is intentionally emphasized in the evaluation set because they pose a harder challenge beyond general visual description.