Table of Contents
Fetching ...

Radio Astronomy in the Era of Vision-Language Models: Prompt Sensitivity and Adaptation

Mariia Drozdova, Erica Lastufka, Vitaliy Kinakh, Taras Holotyak, Daniel Schaerer, Slava Voloshynovskiy

TL;DR

This work investigates whether generic vision-language models can perform morphology-based classification of radio galaxies (FR-I vs FR-II) on MiraBest data, using prompting strategies and lightweight LoRA fine-tuning. It shows that VLMs carry useful priors for unfamiliar scientific imagery, but their outputs can be highly sensitive to prompt design and decoding settings, revealing fragility in their reasoning. With around 15 million trainable parameters, LoRA-tuned Qwen-VL approaches domain-specific performance, rivaling specialized models with minimal astronomy pretraining. The study highlights both the potential and the caution required when applying VLMs to scientific tasks, suggesting careful prompt engineering and targeted adaptation as practical paths forward.

Abstract

Vision-Language Models (VLMs), such as recent Qwen and Gemini models, are positioned as general-purpose AI systems capable of reasoning across domains. Yet their capabilities in scientific imaging, especially on unfamiliar and potentially previously unseen data distributions, remain poorly understood. In this work, we assess whether generic VLMs, presumed to lack exposure to astronomical corpora, can perform morphology-based classification of radio galaxies using the MiraBest FR-I/FR-II dataset. We explore prompting strategies using natural language and schematic diagrams, and, to the best of our knowledge, we are the first to introduce visual in-context examples within prompts in astronomy. Additionally, we evaluate lightweight supervised adaptation via LoRA fine-tuning. Our findings reveal three trends: (i) even prompt-based approaches can achieve good performance, suggesting that VLMs encode useful priors for unfamiliar scientific domains; (ii) however, outputs are highly unstable, i.e. varying sharply with superficial prompt changes such as layout, ordering, or decoding temperature, even when semantic content is held constant; and (iii) with just 15M trainable parameters and no astronomy-specific pretraining, fine-tuned Qwen-VL achieves near state-of-the-art performance (3% Error rate), rivaling domain-specific models. These results suggest that the apparent "reasoning" of VLMs often reflects prompt sensitivity rather than genuine inference, raising caution for their use in scientific domains. At the same time, with minimal adaptation, generic VLMs can rival specialized models, offering a promising but fragile tool for scientific discovery.

Radio Astronomy in the Era of Vision-Language Models: Prompt Sensitivity and Adaptation

TL;DR

This work investigates whether generic vision-language models can perform morphology-based classification of radio galaxies (FR-I vs FR-II) on MiraBest data, using prompting strategies and lightweight LoRA fine-tuning. It shows that VLMs carry useful priors for unfamiliar scientific imagery, but their outputs can be highly sensitive to prompt design and decoding settings, revealing fragility in their reasoning. With around 15 million trainable parameters, LoRA-tuned Qwen-VL approaches domain-specific performance, rivaling specialized models with minimal astronomy pretraining. The study highlights both the potential and the caution required when applying VLMs to scientific tasks, suggesting careful prompt engineering and targeted adaptation as practical paths forward.

Abstract

Vision-Language Models (VLMs), such as recent Qwen and Gemini models, are positioned as general-purpose AI systems capable of reasoning across domains. Yet their capabilities in scientific imaging, especially on unfamiliar and potentially previously unseen data distributions, remain poorly understood. In this work, we assess whether generic VLMs, presumed to lack exposure to astronomical corpora, can perform morphology-based classification of radio galaxies using the MiraBest FR-I/FR-II dataset. We explore prompting strategies using natural language and schematic diagrams, and, to the best of our knowledge, we are the first to introduce visual in-context examples within prompts in astronomy. Additionally, we evaluate lightweight supervised adaptation via LoRA fine-tuning. Our findings reveal three trends: (i) even prompt-based approaches can achieve good performance, suggesting that VLMs encode useful priors for unfamiliar scientific domains; (ii) however, outputs are highly unstable, i.e. varying sharply with superficial prompt changes such as layout, ordering, or decoding temperature, even when semantic content is held constant; and (iii) with just 15M trainable parameters and no astronomy-specific pretraining, fine-tuned Qwen-VL achieves near state-of-the-art performance (3% Error rate), rivaling domain-specific models. These results suggest that the apparent "reasoning" of VLMs often reflects prompt sensitivity rather than genuine inference, raising caution for their use in scientific domains. At the same time, with minimal adaptation, generic VLMs can rival specialized models, offering a promising but fragile tool for scientific discovery.

Paper Structure

This paper contains 28 sections, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Test error rates across prompting strategies with/without CoT. Boxplots summarize variation across prompts and image placement with respect to the query question. More details in Appendix\ref{['app:analysis']}.
  • Figure 2: Stability of example-conditioned prompts: (a) Error vs. number of retrieved neighbors; lower temperatures reduce error and spread; (b) Error across all permutations: Fixed-Imgs is less stable than kNN-Imgs; (c) kNN-Imgs outperform kNN majority voting by 5 points across train sizes.
  • Figure 3: FR-I vs FR-II radio galaxy morphologies. (a) Schematic illustration of the Fanaroff–Riley classification alexander_resources. (b–c) MiraBest radio images.
  • Figure 4: Radar plots showing test accuracy across Text prompts. (a) Image placed after the query; (b) image placed before. Gemini consistently outperforms other models across prompt variants.
  • Figure 5: Radar plots showing test accuracy across Diagram prompts. (a) Image placed after the query; (b) image placed before. Gemini consistently outperforms other models across prompt variants.
  • ...and 3 more figures