Table of Contents
Fetching ...

MedVersa: A Generalist Foundation Model for Medical Image Interpretation

Hong-Yu Zhou, Julián Nicolás Acosta, Subathra Adithan, Suvrankar Datta, Eric J. Topol, Pranav Rajpurkar

TL;DR

MedVersa introduces a generalist medical foundation model that uses an optimizable LLM as an orchestrator to coordinate diverse vision modules for multimodal medical image interpretation. Trained on tens of millions of medical instances, it achieves state-of-the-art or competitive performance across radiology report generation, vision-centric tasks, and multiple external cohorts. Radiologist and user studies indicate AI-generated reports are often superior or equivalent and can substantially reduce reporting time, highlighting tangible clinical workflow benefits. The work demonstrates the viability of extensible, multimodal generalist AI in clinical radiology and outlines a pathway toward broader modality coverage and continual learning.

Abstract

Current medical AI systems are often limited to narrow applications, hindering widespread adoption. We present MedVersa, a generalist foundation model trained on tens of millions of compiled medical instances. MedVersa unlocks generalist learning from multimodal inputs and outputs, representing the first example of a generalist model reaching competitive performance with leading specialized solutions across a variety of medical imaging scenarios. MedVersa achieves state-of-the-art performance in nine tasks, sometimes outperforming counterparts by over 10%. Radiologist evaluation shows MedVersa-generated reports get superior performance in 95% of normal studies, while matching or exceeding human reports in 71% of cases overall. User studies showed notable reductions in report writing time and discrepancies with the use of MedVersa. Our findings underscore the value of flexible, multimodal AI systems in advancing medical image interpretation and supporting clinical expertise.

MedVersa: A Generalist Foundation Model for Medical Image Interpretation

TL;DR

MedVersa introduces a generalist medical foundation model that uses an optimizable LLM as an orchestrator to coordinate diverse vision modules for multimodal medical image interpretation. Trained on tens of millions of medical instances, it achieves state-of-the-art or competitive performance across radiology report generation, vision-centric tasks, and multiple external cohorts. Radiologist and user studies indicate AI-generated reports are often superior or equivalent and can substantially reduce reporting time, highlighting tangible clinical workflow benefits. The work demonstrates the viability of extensible, multimodal generalist AI in clinical radiology and outlines a pathway toward broader modality coverage and continual learning.

Abstract

Current medical AI systems are often limited to narrow applications, hindering widespread adoption. We present MedVersa, a generalist foundation model trained on tens of millions of compiled medical instances. MedVersa unlocks generalist learning from multimodal inputs and outputs, representing the first example of a generalist model reaching competitive performance with leading specialized solutions across a variety of medical imaging scenarios. MedVersa achieves state-of-the-art performance in nine tasks, sometimes outperforming counterparts by over 10%. Radiologist evaluation shows MedVersa-generated reports get superior performance in 95% of normal studies, while matching or exceeding human reports in 71% of cases overall. User studies showed notable reductions in report writing time and discrepancies with the use of MedVersa. Our findings underscore the value of flexible, multimodal AI systems in advancing medical image interpretation and supporting clinical expertise.
Paper Structure (16 sections, 10 figures, 5 tables)

This paper contains 16 sections, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Study overview. a, The three-stage pipeline showing capability development, model validation, and impact study phases. Each phase builds upon model curation, technical and clinical evaluation, and expert use of AI, respectively. b, Development process that outlines dataset compilation across tasks, architecture design integrating multimodal inputs and outputs, and efficient model training strategies. c, Validation framework demonstrating internal and external validation across multicohorts, with emphasis on task improvements through generalist learning and clinical relevance assessment. d, Protocols for comparing human-written and AI-generated reports through blinded assessment by board-certified radiologists to ensure unbiased evaluation. e, Impact study workflow showing the comparative analysis of normal negative template, AI-generated template, and GPT-generated template to assess reduction in report writing time and discrepancies across radiologists.
  • Figure 2: Dataset and model performance. a, Data preprocessing pipeline showing the integration of public datasets and their transformation into varied components including image, context, task prompt, answer text, location, and masks, with a circular chart (values are log-transformed) showing the distribution of different tasks in the dataset. b, Comparisons to state-of-the-art specialized solutions. Relative improvments and p-values are also displayed. c, Performance gains brought by doing generalist learning over using vision-language or vision-centric data for model training. Chest radiographs were chosen for their diverse analytical tasks.
  • Figure 3: Clinical evaluation.a, Evaluation pipeline showing the blinded assessment process where board-certified radiologists compare randomly shuffled, deidentified human-written and AI-generated reports based on multiview images and clinical context. b, Interface design for radiologist evaluation, including rating options and report categorization. c, Quantitative assessment results showing the distribution of preferences for human versus AI reports across overall, abnormal, and normal cases. d, Protocols for making comments on reports. e, Cases demonstrating the comparison between human-written and AI-generated reports, including image, clinical context, report content, radiologist comments, and ratings.
  • Figure 4: Clinical impact study.a, Study design showing multiple board-certified radiologists (E1-E10) editing reports based on different templates (negative normal, our AI-generated, and GPT-generated) for multiple studies (S1-S6), measuring time spent and discrepancies. b, Matrix visualization of urgent and emergent discrepancies across experts and studies, with average number of discrepancies per expert. c, Comparison of average number of urgent and emergent discrepancies across different report writing time intervals for different template types. d, Detailed template usage patterns across different studies and experts, with average template utilization rates shown. e, Time saved in report writing compared to normal template usage, stratified by top percentage of readers based on average template use.
  • Figure 5: Experimental results of vision-centric tasks.a, MedVersa was compared to three baseline models: DAM (Deep AUC Maximization), MedViT, and BiomedGPT on chest pathology classification. For skin lesion classification, we replaced DAM with CRCKD (Categorical Relation-preserving Contrastive Knowledge Distillation), a model specifically designed for this task. b, We compared MedVersa against YOLO (version five) on two detection tasks: anatomical structure and chest pathology detection. c, For 2D image segmentation, we primarily compared MedVersa with nnUNet2D and nnSAM for segmenting major organs in the chest and skin lesions. d, For 3D image segmentation, MedVersa was compared with nnUNet3D and TransUNet3D for segmenting abdominal organs. e, We investigated the development of a versatile segmentation model for various imaging modalities, extending the functionality of MedVersa. The baseline approach, MedSAM, is a segment anything model finetuned specifically on medical segmentation data. The evaluation metrics of classification, detection, and segmentation tasks are F1, IoU (Intersection over Union), and DICE similarity scores, respectively.
  • ...and 5 more figures