Medical Vision Generalist: Unifying Medical Imaging Tasks in Context

Sucheng Ren; Xiaoke Huang; Xianhang Li; Junfei Xiao; Jieru Mei; Zeyu Wang; Alan Yuille; Yuyin Zhou

Medical Vision Generalist: Unifying Medical Imaging Tasks in Context

Sucheng Ren, Xiaoke Huang, Xianhang Li, Junfei Xiao, Jieru Mei, Zeyu Wang, Alan Yuille, Yuyin Zhou

TL;DR

Medical Vision Generalist (MVG) introduces the first end-to-end foundation model for medical imaging that unifies segmentation, denoising, inpainting, and cross-modal synthesis across four modalities within a single image-to-image generation framework. MVG standardizes inputs/outputs via in-context coloring and uses task prompts to condition generation, combining mask image modeling and autoregressive training to capture both local detail and global context. Evaluated on a new 13-dataset, 4-modality benchmark, MVG consistently outperforms existing vision generalists and shows strong data scalability and ability to adapt to unseen datasets with minimal task-specific samples. While MVG trails specialist models in some benchmarks, its flexibility, in-context adaptability, and demonstrated scalability highlight its potential to accelerate multi-task medical imaging analysis and reduce the need for task-specific retuning. The work also provides a comprehensive generalist medical vision benchmark to guide future research and development in medical AI generalists.

Abstract

This study presents Medical Vision Generalist (MVG), the first foundation model capable of handling various medical imaging tasks -- such as cross-modal synthesis, image segmentation, denoising, and inpainting -- within a unified image-to-image generation framework. Specifically, MVG employs an in-context generation strategy that standardizes the handling of inputs and outputs as images. By treating these tasks as an image generation process conditioned on prompt image-label pairs and input images, this approach enables a flexible unification of various tasks, even those spanning different modalities and datasets. To capitalize on both local and global context, we design a hybrid method combining masked image modeling with autoregressive training for conditional image generation. This hybrid approach yields the most robust performance across all involved medical imaging tasks. To rigorously evaluate MVG's capabilities, we curated the first comprehensive generalist medical vision benchmark, comprising 13 datasets and spanning four imaging modalities (CT, MRI, X-ray, and micro-ultrasound). Our results consistently establish MVG's superior performance, outperforming existing vision generalists, such as Painter and LVM. Furthermore, MVG exhibits strong scalability, with its performance demonstrably improving when trained on a more diverse set of tasks, and can be effectively adapted to unseen datasets with only minimal task-specific samples. The code is available at \url{https://github.com/OliverRensu/MVG}.

Medical Vision Generalist: Unifying Medical Imaging Tasks in Context

TL;DR

Abstract

Paper Structure (35 sections, 2 equations, 6 figures, 6 tables)

This paper contains 35 sections, 2 equations, 6 figures, 6 tables.

Introduction
Related Work
Medical Image Analysis.
Universal Models and In-Context Learning.
Method
Tasks
Unifying the Input/Output Space
Task Unification via Conditional Image Generation
Architecture Selection.
Mask Image Modeling.
Auto-Regressive Training.
Loss Function.
Inference.
Experiment
Implementation Details
...and 20 more sections

Figures (6)

Figure 1: Medical Vision Generalist enables a single model be capable of performing four types of medical vision tasks on images in four medical imaging modalities of three major body regions.
Figure 2: Comparison with other generalists. Our model achieves state-of-the-art performance on all involved medical vision tasks of five types.
Figure 3: Method overview. Left: Four types of medical tasks (i.e., segmentation, cross-modal synthesis, inpainting, and denoising) are unified as a universal image-to-image generation task with in-context learning. Right: We adopt mask image modeling and auto-regressive training for in-context generation.
Figure 4: Qualitative evaluation of four tasks: Segmentation (1st row), denoising (2nd row),cross-modal synthesis (3th row), and inpainting (4th row).
Figure 5: Impact of training data scale. We ablate on various scales of the training data (randomly sampled from each dataset), ranging from 1% to 100%.
...and 1 more figures

Medical Vision Generalist: Unifying Medical Imaging Tasks in Context

TL;DR

Abstract

Medical Vision Generalist: Unifying Medical Imaging Tasks in Context

Authors

TL;DR

Abstract

Table of Contents

Figures (6)