Table of Contents
Fetching ...

Toward Vision-Language Assistants for Radio Astronomical Source Analysis

S. Riggi

TL;DR

The paper investigates whether small to mid-sized vision-language models can serve as domain-specific assistants for radio astronomical image analysis. It benchmarks open-weight VLMs and GPT-4.1 in zero-shot settings and introduces radio-llava, a LLaVA-based assistant fine-tuned on radio-domain Q&A and caption data with a frozen vision encoder. Results show GPT-4.1 performs well in some zero-shot tasks but is outperformed by domain-tuned VLMs and, in many cases, by vision-only models specialized to radio data. However, deeper fine-tuning triggers catastrophic forgetting on general multimodal tasks, highlighting a trade-off between specialization and generalization and pointing to data diversity and strategies like LoRA to mitigate these effects.

Abstract

Vision-language models (VLMs) have recently shown promise in general-purpose reasoning tasks, yet their applicability to domain-specific scientific workflows remains largely unexplored. In this work, we evaluated a series of open-weight and commercial VLMs on six tasks relevant to radio astronomy, such as source morphology classification. We also introduced radio-llava, a fine-tuned multimodal assistant built on the LLaVA architecture and adapted for the radio domain through instruction fine-tuning. In zero-shot mode, commercial models like GPT-4.1 outperform open-weight VLMs on most radio benchmarks. However, radio-llava significantly improves upon both base LLaVA and commercial models across nearly all tasks. Despite these gains, specialized vision-only models still deliver substantially better performance across the board. Additionally, we observed that fine-tuning introduces catastrophic forgetting on general multimodal tasks, with performance drops up to 40% that can be partly mitigated with a more diverse training dataset or shallow fine-tuning.

Toward Vision-Language Assistants for Radio Astronomical Source Analysis

TL;DR

The paper investigates whether small to mid-sized vision-language models can serve as domain-specific assistants for radio astronomical image analysis. It benchmarks open-weight VLMs and GPT-4.1 in zero-shot settings and introduces radio-llava, a LLaVA-based assistant fine-tuned on radio-domain Q&A and caption data with a frozen vision encoder. Results show GPT-4.1 performs well in some zero-shot tasks but is outperformed by domain-tuned VLMs and, in many cases, by vision-only models specialized to radio data. However, deeper fine-tuning triggers catastrophic forgetting on general multimodal tasks, highlighting a trade-off between specialization and generalization and pointing to data diversity and strategies like LoRA to mitigate these effects.

Abstract

Vision-language models (VLMs) have recently shown promise in general-purpose reasoning tasks, yet their applicability to domain-specific scientific workflows remains largely unexplored. In this work, we evaluated a series of open-weight and commercial VLMs on six tasks relevant to radio astronomy, such as source morphology classification. We also introduced radio-llava, a fine-tuned multimodal assistant built on the LLaVA architecture and adapted for the radio domain through instruction fine-tuning. In zero-shot mode, commercial models like GPT-4.1 outperform open-weight VLMs on most radio benchmarks. However, radio-llava significantly improves upon both base LLaVA and commercial models across nearly all tasks. Despite these gains, specialized vision-only models still deliver substantially better performance across the board. Additionally, we observed that fine-tuning introduces catastrophic forgetting on general multimodal tasks, with performance drops up to 40% that can be partly mitigated with a more diverse training dataset or shallow fine-tuning.
Paper Structure (4 sections, 3 figures, 1 table)

This paper contains 4 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Classification "macro-averaged" F1-score obtained across tasks B1–B6 in zero-shot mode with open-weight VLMs of different sizes (0.5B, 2B, 3.1B, 7B, 8B, 72B), shown with coloured histograms (LLaVA: blue, TinyLLaVA: green, Qwen2VL: red, InternVL: orange), and OpenAI GPT-4.1 (black histograms).
  • Figure 2: Classification "macro-averaged" F1-score obtained across tasks B1–B6 with the radio-llava model, fine-tuned on the Q&A training dataset (blue histograms) and the combined Q&A and caption datasets (orange histograms) with different training strategies (full vs. LoRA fine-tuning) and depths (shallow vs. deep). Results from different baseline models are also shown: base llava-ov-7b (red histograms), fine-tuned siglip-so400m-patch14-384 vision encoder (green histograms), OpenAI GPT-4.1 (black histograms).
  • Figure 3: Classification accuracy differences between the llava-ov-7b base model and the fine-tuned radio-llava models (blue and orange histograms), evaluated on various multimodal (non-radio) benchmarks.