RadVLM: A Multitask Conversational Vision-Language Model for Radiology

Nicolas Deperrois; Hidetoshi Matsuo; Samuel Ruipérez-Campillo; Moritz Vandenhirtz; Sonia Laguna; Alain Ryser; Koji Fujimoto; Mizuho Nishio; Thomas M. Sutter; Julia E. Vogt; Jonas Kluckert; Thomas Frauenfelder; Christian Blüthgen; Farhad Nooralahzadeh; Michael Krauthammer

RadVLM: A Multitask Conversational Vision-Language Model for Radiology

Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas M. Sutter, Julia E. Vogt, Jonas Kluckert, Thomas Frauenfelder, Christian Blüthgen, Farhad Nooralahzadeh, Michael Krauthammer

TL;DR

RadVLM introduces a compact, multitask conversational vision-language model for chest X-ray interpretation, built on a large instruction dataset that covers free-text report generation, abnormality classification, grounding, and multi-turn conversations. Through end-to-end fine-tuning of a VLM backbone, RadVLM achieves state-of-the-art conversational and grounding capabilities while remaining competitive on core radiology tasks. A comprehensive evaluation against re-implemented baselines demonstrates the benefits of joint multi-task training, especially in data-scarce scenarios, and highlights the model's potential as a clinically relevant AI assistant. The work also emphasizes reproducibility by re-implementing baselines under a unified framework and points to future RL-based optimization to further align reasoning and clinical accuracy.

Abstract

The widespread use of chest X-rays (CXRs), coupled with a shortage of radiologists, has driven growing interest in automated CXR analysis and AI-assisted reporting. While existing vision-language models (VLMs) show promise in specific tasks such as report generation or abnormality detection, they often lack support for interactive diagnostic capabilities. In this work we present RadVLM, a compact, multitask conversational foundation model designed for CXR interpretation. To this end, we curate a large-scale instruction dataset comprising over 1 million image-instruction pairs containing both single-turn tasks -- such as report generation, abnormality classification, and visual grounding -- and multi-turn, multi-task conversational interactions. After fine-tuning RadVLM on this instruction dataset, we evaluate it across different tasks along with re-implemented baseline VLMs. Our results show that RadVLM achieves state-of-the-art performance in conversational capabilities and visual grounding while remaining competitive in other radiology tasks. Ablation studies further highlight the benefit of joint training across multiple tasks, particularly for scenarios with limited annotated data. Together, these findings highlight the potential of RadVLM as a clinically relevant AI assistant, providing structured CXR interpretation and conversational capabilities to support more effective and accessible diagnostic workflows.

RadVLM: A Multitask Conversational Vision-Language Model for Radiology

TL;DR

Abstract

RadVLM: A Multitask Conversational Vision-Language Model for Radiology

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)