Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology

Nur Yildirim; Hannah Richardson; Maria T. Wetscherek; Junaid Bajwa; Joseph Jacob; Mark A. Pinnock; Stephen Harris; Daniel Coelho de Castro; Shruthi Bannur; Stephanie L. Hyland; Pratik Ghosh; Mercy Ranjit; Kenza Bouzid; Anton Schwaighofer; Fernando Pérez-García; Harshita Sharma; Ozan Oktay; Matthew Lungren; Javier Alvarez-Valle; Aditya Nori; Anja Thieme

Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology

Nur Yildirim, Hannah Richardson, Maria T. Wetscherek, Junaid Bajwa, Joseph Jacob, Mark A. Pinnock, Stephen Harris, Daniel Coelho de Castro, Shruthi Bannur, Stephanie L. Hyland, Pratik Ghosh, Mercy Ranjit, Kenza Bouzid, Anton Schwaighofer, Fernando Pérez-García, Harshita Sharma, Ozan Oktay, Matthew Lungren, Javier Alvarez-Valle, Aditya Nori, Anja Thieme

TL;DR

This paper investigates how vision-language models can augment radiology workflows through a human-centered, three-phase design process. By engaging 13 radiologists and clinicians, it identifies four clinically relevant use concepts (Draft Report Generation, Augmented Report Review, Visual Search and Querying, and Patient Imaging History Highlights) and develops prototype sketches and user feedback to surface design requirements. The study highlights that VLMs offer value in information extraction, evidence retrieval, and workflow support, but emphasizes constraints around AI performance, latency, risk, and seamless integration into fast-paced clinical practice. The findings inform practical guidelines for deploying VLMs in radiology and broader healthcare contexts, stressing task-specific tooling, EHR alignment, and human-in-the-loop governance.

Abstract

Recent advances in AI combine large language models (LLMs) with vision encoders that bring forward unprecedented technical capabilities to leverage for a wide range of healthcare applications. Focusing on the domain of radiology, vision-language models (VLMs) achieve good performance results for tasks such as generating radiology findings based on a patient's medical image, or answering visual questions (e.g., 'Where are the nodules in this chest X-ray?'). However, the clinical utility of potential applications of these capabilities is currently underexplored. We engaged in an iterative, multidisciplinary design process to envision clinically relevant VLM interactions, and co-designed four VLM use concepts: Draft Report Generation, Augmented Report Review, Visual Search and Querying, and Patient Imaging History Highlights. We studied these concepts with 13 radiologists and clinicians who assessed the VLM concepts as valuable, yet articulated many design considerations. Reflecting on our findings, we discuss implications for integrating VLM capabilities in radiology, and for healthcare AI more generally.

Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology

TL;DR

Abstract

Paper Structure (43 sections, 9 figures, 1 table)

This paper contains 43 sections, 9 figures, 1 table.

Introduction
Related Work
VLMs: Multimodal Foundation Models
Human-Centered Medical AI
Designing AI with Domain Stakeholders
Overview of Radiology Workflows
Method
Phase 1: Brainstorming VLM Use Cases
In-depth Discussions
Brainstorming Sessions
Data Collection and Analysis
Phase 2: Sketching VLM Concepts
Phase 3: User Feedback Sessions
Participants
Procedure and Data Analysis
...and 28 more sections

Figures (9)

Figure 1: Overview of the radiology workflow. See Supplementary Material for details on pain points and opportunities. A figure presenting an overview of the radiology workflow and how the work is distributed between clinicians and radiologists. There are seven boxes detailing phases for image request; scan; preview; assignment; reporting; communication; care decision.
Figure 2: The Draft Report Generation (radiologist only) concept displayed (a) a chest X-ray image with patient information and clinical indication, (b) an AI-generated report in bullet point form, and (c) a narrative report created using the bullet points. A prototype displaying the Draft Report Generation concept with a chest x-ray image on the left, bullet point report findings in the middle, and prose report on the right.
Figure 3: The Augmented Report Review (clinician only) concept displayed (a) a report overview feature above the full report, and (b) an AI assistant feature. A prototype displaying the Augmented Report Review concept with a chest x-ray image on the left, report overview section with bullet point findings in the middle, and an AI assistant section with several prompts on the left. A selected prompt displays a result for the query "What are guidelines for pleural effusion?"
Figure 4: The Visual Search and Querying concept displayed (a) a visual selection tool that enabled image search or image and text queries, (b) an AI assistant that returned query results without providing an interpretative answer. A prototype displaying the Visual Search and Querying concept with a chest x-ray image on the right that has a selected region, and an AI assistant on the right that returned query results for the selected region showing similar images diagnosed as anatomic variants versus similar images diagnosed as lump.
Figure 5: The Patient Imaging History Highlights concept displayed (a) a new X-ray scan, (b) prior patient images, and (c) an AI-generated summary of prior images and/or reports. A prototype displaying the Patient Imaging History Highlights concept with a chest x-ray on the right, several past patient images as thumbnails on the left, and an AI-generated short summary at the top.
...and 4 more figures

Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology

TL;DR

Abstract

Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology

Authors

TL;DR

Abstract

Table of Contents

Figures (9)