Table of Contents
Fetching ...

MedDiff-FM: A Diffusion-based Foundation Model for Versatile Medical Image Applications

Yongrui Yu, Yannian Gu, Shaoting Zhang, Xiaofan Zhang

TL;DR

MedDiff-FM introduces a diffusion-based foundation model trained on diverse 3D CT datasets spanning multiple anatomical regions to enable broad medical image tasks without region-specific models. It employs multi-level image processing with image- and patch-level inputs, 3D position embeddings, and coarse region plus fine-grained anatomical conditioning, plus ControlNet for task-specific fine-tuning. The approach demonstrates strong generalization across synthesis, denoising, and anomaly detection without fine-tuning, and achieves competitive or superior performance on supervised tasks like volumetric super-resolution and lesion generation after fine-tuning. The work highlights the potential of diffusion foundation models to unify multiple medical imaging tasks and regions, reducing training costs and enabling cross-domain knowledge transfer.

Abstract

Diffusion models have achieved significant success in both natural image and medical image domains, encompassing a wide range of applications. Previous investigations in medical images have often been constrained to specific anatomical regions, particular applications, and limited datasets, resulting in isolated diffusion models. This paper introduces a diffusion-based foundation model to address a diverse range of medical image tasks, namely MedDiff-FM. MedDiff-FM leverages 3D CT images from multiple publicly available datasets, covering anatomical regions from head to abdomen, to pre-train a diffusion foundation model, and explores the capabilities of the diffusion foundation model across a variety of application scenarios. The diffusion foundation model handles multi-level integrated image processing both at the image-level and patch-level, utilizes position embedding to establish multi-level spatial relationships, and leverages region classes and anatomical structures to capture certain anatomical regions. MedDiff-FM manages several downstream tasks seamlessly, including image denoising, anomaly detection, and image synthesis. MedDiff-FM is also capable of performing super-resolution, lesion generation, and lesion inpainting by rapidly fine-tuning the diffusion foundation model using ControlNet with task-specific conditions. The experimental results demonstrate the effectiveness of MedDiff-FM in addressing diverse downstream medical image tasks.

MedDiff-FM: A Diffusion-based Foundation Model for Versatile Medical Image Applications

TL;DR

MedDiff-FM introduces a diffusion-based foundation model trained on diverse 3D CT datasets spanning multiple anatomical regions to enable broad medical image tasks without region-specific models. It employs multi-level image processing with image- and patch-level inputs, 3D position embeddings, and coarse region plus fine-grained anatomical conditioning, plus ControlNet for task-specific fine-tuning. The approach demonstrates strong generalization across synthesis, denoising, and anomaly detection without fine-tuning, and achieves competitive or superior performance on supervised tasks like volumetric super-resolution and lesion generation after fine-tuning. The work highlights the potential of diffusion foundation models to unify multiple medical imaging tasks and regions, reducing training costs and enabling cross-domain knowledge transfer.

Abstract

Diffusion models have achieved significant success in both natural image and medical image domains, encompassing a wide range of applications. Previous investigations in medical images have often been constrained to specific anatomical regions, particular applications, and limited datasets, resulting in isolated diffusion models. This paper introduces a diffusion-based foundation model to address a diverse range of medical image tasks, namely MedDiff-FM. MedDiff-FM leverages 3D CT images from multiple publicly available datasets, covering anatomical regions from head to abdomen, to pre-train a diffusion foundation model, and explores the capabilities of the diffusion foundation model across a variety of application scenarios. The diffusion foundation model handles multi-level integrated image processing both at the image-level and patch-level, utilizes position embedding to establish multi-level spatial relationships, and leverages region classes and anatomical structures to capture certain anatomical regions. MedDiff-FM manages several downstream tasks seamlessly, including image denoising, anomaly detection, and image synthesis. MedDiff-FM is also capable of performing super-resolution, lesion generation, and lesion inpainting by rapidly fine-tuning the diffusion foundation model using ControlNet with task-specific conditions. The experimental results demonstrate the effectiveness of MedDiff-FM in addressing diverse downstream medical image tasks.

Paper Structure

This paper contains 32 sections, 12 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: An overview of the datasets, anatomical regions, and downstream applications of MedDiff-FM. Covering multiple datasets and diverse anatomical structures, MedDiff-FM supports various downstream tasks with or without fine-tuning.
  • Figure 2: The pre-training and fine-tuning pipelines of MedDiff-FM. MedDiff-FM accommodates multi-level medical image inputs to handle the diversity in CT sizes and spacings, leverages positional embeddings to build multi-level spatial relationships, and utilizes both coarse region conditions and fine-grained anatomical conditions. During fine-tuning, task-specific conditions are incorporated via ControlNet.
  • Figure 3: The multi-level position relationships constructed based on position embedding. Positional relationships illustrated in the X–Y plane, with the Z-axis handled similarly.
  • Figure 4: The process of patch-level whole-volume synthesis. MedDiff-FM employs a patch-based sliding window sampling strategy, using overlapping windows and smoothed noise estimates to effectively eliminate boundary artifacts.
  • Figure 5: Comparison visualization results of whole CT volume synthesis, including HaN, chest, and abdomen regions.
  • ...and 5 more figures