Table of Contents
Fetching ...

MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant

Chenlu Zhan, Yu Lin, Gaoang Wang, Hongwei Wang, Jian Wu

TL;DR

MedM2G is the first medical generative model that unifies medical generation tasks of text-to-image, image-to-text, and unified generation of medical modalities (CT, MRI, X-ray), and it performs 5 medical generation tasks across 10 datasets, consistently outperforming various state-of-the-art works.

Abstract

Medical generative models, acknowledged for their high-quality sample generation ability, have accelerated the fast growth of medical applications. However, recent works concentrate on separate medical generation models for distinct medical tasks and are restricted to inadequate medical multi-modal knowledge, constraining medical comprehensive diagnosis. In this paper, we propose MedM2G, a Medical Multi-Modal Generative framework, with the key innovation to align, extract, and generate medical multi-modal within a unified model. Extending beyond single or two medical modalities, we efficiently align medical multi-modal through the central alignment approach in the unified space. Significantly, our framework extracts valuable clinical knowledge by preserving the medical visual invariant of each imaging modal, thereby enhancing specific medical information for multi-modal generation. By conditioning the adaptive cross-guided parameters into the multi-flow diffusion framework, our model promotes flexible interactions among medical multi-modal for generation. MedM2G is the first medical generative model that unifies medical generation tasks of text-to-image, image-to-text, and unified generation of medical modalities (CT, MRI, X-ray). It performs 5 medical generation tasks across 10 datasets, consistently outperforming various state-of-the-art works.

MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant

TL;DR

MedM2G is the first medical generative model that unifies medical generation tasks of text-to-image, image-to-text, and unified generation of medical modalities (CT, MRI, X-ray), and it performs 5 medical generation tasks across 10 datasets, consistently outperforming various state-of-the-art works.

Abstract

Medical generative models, acknowledged for their high-quality sample generation ability, have accelerated the fast growth of medical applications. However, recent works concentrate on separate medical generation models for distinct medical tasks and are restricted to inadequate medical multi-modal knowledge, constraining medical comprehensive diagnosis. In this paper, we propose MedM2G, a Medical Multi-Modal Generative framework, with the key innovation to align, extract, and generate medical multi-modal within a unified model. Extending beyond single or two medical modalities, we efficiently align medical multi-modal through the central alignment approach in the unified space. Significantly, our framework extracts valuable clinical knowledge by preserving the medical visual invariant of each imaging modal, thereby enhancing specific medical information for multi-modal generation. By conditioning the adaptive cross-guided parameters into the multi-flow diffusion framework, our model promotes flexible interactions among medical multi-modal for generation. MedM2G is the first medical generative model that unifies medical generation tasks of text-to-image, image-to-text, and unified generation of medical modalities (CT, MRI, X-ray). It performs 5 medical generation tasks across 10 datasets, consistently outperforming various state-of-the-art works.
Paper Structure (17 sections, 6 equations, 7 figures, 9 tables)

This paper contains 17 sections, 6 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Our MedM2G on multiple medical generative tasks. By effectively extracting clinical visual knowledge of multiple medical modalities and adopting the latent multi-flow cross-guided diffusion process, MedM2G has the capability of the unified medical image-to-text, text-to-image diffusion, as well as the unified generation of medical modalities (CT, MRI, X-ray).
  • Figure 2: The network structure of MedM2G. (a) The multiple medical modalities are embedded into a unified sharing space and present the text as the central modality to efficiently align the other modalities. (b) To maintain the clinic knowledge, we minimize the off-diagonal elements of the cross-correlation matrix of the two augmented image views. (c) We directly condition the representation as the trainable adaptation to capture the semantic knowledge for the generation and adopt the cross-attention sub-layer of one modality to align another.
  • Figure 3: The multi-flow training strategy through 3 rounds of paired training for the multi-modal generation with the central alignment.
  • Figure 4: The qualitative analysis of (a) Medical report generation task (b) Medical text-image generation task (c) Unified medical multi-modality generation. The indication in green: the correctly predicted MeSH terms.
  • Figure 5: Multiple medical modalities generation tasks by MedM2D. (a) MRI synthesis task on IXI dataset. (b) MRI-CT transition task on Pelvi dataset. (c) CT-Xray generation task on Chestxray dataset..
  • ...and 2 more figures