MedDiff-FM: A Diffusion-based Foundation Model for Versatile Medical Image Applications
Yongrui Yu, Yannian Gu, Shaoting Zhang, Xiaofan Zhang
TL;DR
MedDiff-FM introduces a diffusion-based foundation model trained on diverse 3D CT datasets spanning multiple anatomical regions to enable broad medical image tasks without region-specific models. It employs multi-level image processing with image- and patch-level inputs, 3D position embeddings, and coarse region plus fine-grained anatomical conditioning, plus ControlNet for task-specific fine-tuning. The approach demonstrates strong generalization across synthesis, denoising, and anomaly detection without fine-tuning, and achieves competitive or superior performance on supervised tasks like volumetric super-resolution and lesion generation after fine-tuning. The work highlights the potential of diffusion foundation models to unify multiple medical imaging tasks and regions, reducing training costs and enabling cross-domain knowledge transfer.
Abstract
Diffusion models have achieved significant success in both natural image and medical image domains, encompassing a wide range of applications. Previous investigations in medical images have often been constrained to specific anatomical regions, particular applications, and limited datasets, resulting in isolated diffusion models. This paper introduces a diffusion-based foundation model to address a diverse range of medical image tasks, namely MedDiff-FM. MedDiff-FM leverages 3D CT images from multiple publicly available datasets, covering anatomical regions from head to abdomen, to pre-train a diffusion foundation model, and explores the capabilities of the diffusion foundation model across a variety of application scenarios. The diffusion foundation model handles multi-level integrated image processing both at the image-level and patch-level, utilizes position embedding to establish multi-level spatial relationships, and leverages region classes and anatomical structures to capture certain anatomical regions. MedDiff-FM manages several downstream tasks seamlessly, including image denoising, anomaly detection, and image synthesis. MedDiff-FM is also capable of performing super-resolution, lesion generation, and lesion inpainting by rapidly fine-tuning the diffusion foundation model using ControlNet with task-specific conditions. The experimental results demonstrate the effectiveness of MedDiff-FM in addressing diverse downstream medical image tasks.
