Table of Contents
Fetching ...

RadVLM: A Multitask Conversational Vision-Language Model for Radiology

Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas M. Sutter, Julia E. Vogt, Jonas Kluckert, Thomas Frauenfelder, Christian Blüthgen, Farhad Nooralahzadeh, Michael Krauthammer

TL;DR

RadVLM introduces a compact, multitask conversational vision-language model for chest X-ray interpretation, built on a large instruction dataset that covers free-text report generation, abnormality classification, grounding, and multi-turn conversations. Through end-to-end fine-tuning of a VLM backbone, RadVLM achieves state-of-the-art conversational and grounding capabilities while remaining competitive on core radiology tasks. A comprehensive evaluation against re-implemented baselines demonstrates the benefits of joint multi-task training, especially in data-scarce scenarios, and highlights the model's potential as a clinically relevant AI assistant. The work also emphasizes reproducibility by re-implementing baselines under a unified framework and points to future RL-based optimization to further align reasoning and clinical accuracy.

Abstract

The widespread use of chest X-rays (CXRs), coupled with a shortage of radiologists, has driven growing interest in automated CXR analysis and AI-assisted reporting. While existing vision-language models (VLMs) show promise in specific tasks such as report generation or abnormality detection, they often lack support for interactive diagnostic capabilities. In this work we present RadVLM, a compact, multitask conversational foundation model designed for CXR interpretation. To this end, we curate a large-scale instruction dataset comprising over 1 million image-instruction pairs containing both single-turn tasks -- such as report generation, abnormality classification, and visual grounding -- and multi-turn, multi-task conversational interactions. After fine-tuning RadVLM on this instruction dataset, we evaluate it across different tasks along with re-implemented baseline VLMs. Our results show that RadVLM achieves state-of-the-art performance in conversational capabilities and visual grounding while remaining competitive in other radiology tasks. Ablation studies further highlight the benefit of joint training across multiple tasks, particularly for scenarios with limited annotated data. Together, these findings highlight the potential of RadVLM as a clinically relevant AI assistant, providing structured CXR interpretation and conversational capabilities to support more effective and accessible diagnostic workflows.

RadVLM: A Multitask Conversational Vision-Language Model for Radiology

TL;DR

RadVLM introduces a compact, multitask conversational vision-language model for chest X-ray interpretation, built on a large instruction dataset that covers free-text report generation, abnormality classification, grounding, and multi-turn conversations. Through end-to-end fine-tuning of a VLM backbone, RadVLM achieves state-of-the-art conversational and grounding capabilities while remaining competitive on core radiology tasks. A comprehensive evaluation against re-implemented baselines demonstrates the benefits of joint multi-task training, especially in data-scarce scenarios, and highlights the model's potential as a clinically relevant AI assistant. The work also emphasizes reproducibility by re-implementing baselines under a unified framework and points to future RL-based optimization to further align reasoning and clinical accuracy.

Abstract

The widespread use of chest X-rays (CXRs), coupled with a shortage of radiologists, has driven growing interest in automated CXR analysis and AI-assisted reporting. While existing vision-language models (VLMs) show promise in specific tasks such as report generation or abnormality detection, they often lack support for interactive diagnostic capabilities. In this work we present RadVLM, a compact, multitask conversational foundation model designed for CXR interpretation. To this end, we curate a large-scale instruction dataset comprising over 1 million image-instruction pairs containing both single-turn tasks -- such as report generation, abnormality classification, and visual grounding -- and multi-turn, multi-task conversational interactions. After fine-tuning RadVLM on this instruction dataset, we evaluate it across different tasks along with re-implemented baseline VLMs. Our results show that RadVLM achieves state-of-the-art performance in conversational capabilities and visual grounding while remaining competitive in other radiology tasks. Ablation studies further highlight the benefit of joint training across multiple tasks, particularly for scenarios with limited annotated data. Together, these findings highlight the potential of RadVLM as a clinically relevant AI assistant, providing structured CXR interpretation and conversational capabilities to support more effective and accessible diagnostic workflows.

Paper Structure

This paper contains 27 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Examples of single instructions for different tasks. We design three main types of instructions based on dataset attributes. For datasets containing image-reports pairs (e.g., MIMIC-CXR), we design Q&A towards report generation (a). The instructions for datasets containing abnormality labels (e.g., CheXpert) are designed to perform multi-class classification (b). When bounding boxes are available, we design visual grounding instructions, where the assistant provides the bounding box coordinates to display them on the input image (c,d).
  • Figure 2: Example of LLM-generated conversations within the instruction dataset. LLM-generated user-assistant interactions designed for instruction tuning in RadVLM, covering both standard conversations and grounded responses. (a) Standard conversation: The assistant responds to user queries based on textual attributes extracted from the CXR (e.g., report findings, categorical labels) without explicit spatial references. (b) Conversation with grounding: In addition to textual responses, the assistant provides spatial grounding by referencing anatomical structures with bounding box coordinates. These synthetic interactions are generated by conditioning a text-based LLM on CXR attributes (report, labels, bounding boxes) and prompting it to simulate multi-turn diagnostic dialogues.
  • Figure 3: Instruction fine-tuning of the vision-language model. The CXR image is processed by the vision encoder, and the question is supplied at the language decoder. The flame icons indicate that the vision encoder, adapter, and LLM are all jointly fine-tuned end-to-end, generating the answer through next-token prediction.
  • Figure 4: F1 scores for abnormality classification across different models. Classification performance of RadVLM (grey), RaDialog (yellow), and CheXagent (blue). Bars represent the F1 scores for individual pathology categories, while dashed lines indicate the macro-averaged F1 score across all categories. Note that RaDialog was trained exclusively on MIMIC-CXR labels and therefore evaluated on out-of-domain data.
  • Figure 5: CXR region and abnormality grounding with RadVLM. Examples of RadVLM's grounding predictions for (a) anatomical regions and (b) abnormalities in CXR images. The model predicts bounding boxes indicating the location of queried structures or pathological findings. Green boxes represent ground truth annotations, while red boxes denote model-predicted bounding boxes.
  • ...and 1 more figures