Table of Contents
Fetching ...

Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation

Jiawei Mao, Yuhan Wang, Lifeng Chen, Can Zhao, Yucheng Tang, Dong Yang, Liangqiong Qu, Daguang Xu, Yuyin Zhou

TL;DR

MeDiM introduces the first medical discrete diffusion model that learns shared distributions across imaging and text modalities using a Multimodal Large Language Model backbone, enabling medical image generation, report generation, and joint image–report synthesis. It resolves the mismatch between autoregressive MLLMs and non-causal diffusion by removing causal masks, adding continuous timestep embeddings, and applying AdaLN for stable, bidirectional cross-modal denoising. Across MIMIC-CXR and PathGen, MeDiM achieves state-of-the-art or competitive results in image fidelity (FID) and report quality (METEOR, BLEU), and its paired image–report outputs improve downstream vision-language tasks. This work positions MeDiM as a flexible foundation model for unified medical multimodal generation and reasoning, with potential to enable generalist medical AI agents in clinical settings.

Abstract

Recent advances in generative medical models are constrained by modality-specific scenarios that hinder the integration of complementary evidence from imaging, pathology, and clinical notes. This fragmentation limits their evolution into foundation models that can learn and reason across the full spectrum of biomedical data. We propose MeDiM, the first medical discrete diffusion model that learns shared distributions across modalities without modality-specific components. MeDiM unifies multiple generative tasks: translating between images and text, and jointly producing image-report pairs across domains in response to prompts. Built on a discrete diffusion framework, MeDiM bridges vision and language representations through a shared probabilistic space. To enable unified and flexible medical generation, we employ a multimodal large language model (MLLM) as the diffusion backbone, leveraging its prior knowledge and cross-modal reasoning. Two key designs are introduced: (1) removing the causal attention mask for bidirectional context, and (2) injecting continuous timestep embeddings for diffusion awareness. Experiments demonstrate high-fidelity medical generation (FID 16.60 on MIMIC-CXR and FID 24.19 on PathGen) and accurate report generation (METEOR 0.2650 and 0.2580). Jointly generated image-report pairs further enhance downstream performance (plus6.43 percent BLEU-1, plus18.57 percent BLEU-2, plus31.58 percent BLEU-3, plus4.80 percent METEOR), showing that MeDiM supports coherent and clinically grounded multimodal outputs.

Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation

TL;DR

MeDiM introduces the first medical discrete diffusion model that learns shared distributions across imaging and text modalities using a Multimodal Large Language Model backbone, enabling medical image generation, report generation, and joint image–report synthesis. It resolves the mismatch between autoregressive MLLMs and non-causal diffusion by removing causal masks, adding continuous timestep embeddings, and applying AdaLN for stable, bidirectional cross-modal denoising. Across MIMIC-CXR and PathGen, MeDiM achieves state-of-the-art or competitive results in image fidelity (FID) and report quality (METEOR, BLEU), and its paired image–report outputs improve downstream vision-language tasks. This work positions MeDiM as a flexible foundation model for unified medical multimodal generation and reasoning, with potential to enable generalist medical AI agents in clinical settings.

Abstract

Recent advances in generative medical models are constrained by modality-specific scenarios that hinder the integration of complementary evidence from imaging, pathology, and clinical notes. This fragmentation limits their evolution into foundation models that can learn and reason across the full spectrum of biomedical data. We propose MeDiM, the first medical discrete diffusion model that learns shared distributions across modalities without modality-specific components. MeDiM unifies multiple generative tasks: translating between images and text, and jointly producing image-report pairs across domains in response to prompts. Built on a discrete diffusion framework, MeDiM bridges vision and language representations through a shared probabilistic space. To enable unified and flexible medical generation, we employ a multimodal large language model (MLLM) as the diffusion backbone, leveraging its prior knowledge and cross-modal reasoning. Two key designs are introduced: (1) removing the causal attention mask for bidirectional context, and (2) injecting continuous timestep embeddings for diffusion awareness. Experiments demonstrate high-fidelity medical generation (FID 16.60 on MIMIC-CXR and FID 24.19 on PathGen) and accurate report generation (METEOR 0.2650 and 0.2580). Jointly generated image-report pairs further enhance downstream performance (plus6.43 percent BLEU-1, plus18.57 percent BLEU-2, plus31.58 percent BLEU-3, plus4.80 percent METEOR), showing that MeDiM supports coherent and clinically grounded multimodal outputs.

Paper Structure

This paper contains 29 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: MeDiM, the first medical discrete diffusion model, is a flexible multimodal generator that simultaneously supports: (i) medical image generation from clinical reports, (ii) report generation from medical images, and (iii) joint synthesis of image–report pairs. Zoom in for a better view.
  • Figure 2: Architectural comparison of medical multimodal models.(“BACKBONE") indicates the backbone adopted in each framework. Prior approaches (A-D) cannot perform paired generation and suffer from other key limitations, such as requiring modality-specific components (A, B), inference inefficiency (C), or backbone inflexibility (D). In contrast, our model, MeDiM (E), provides a unified framework designed to overcome these challenges.
  • Figure 3: Overview of the MeDiM architecture. The framework integrates an MLLM backbone within a discrete diffusion process for unified medical multimodal generation. During the forward process, data is tokenized and diffused over timesteps. The MLLM is then trained to reverse this process. Key architectural adaptations, including causal attention removal, timestep embeddings, and AdaLN, adapt the autoregressive MLLM for the bidirectional denoising required for unified medical generation.
  • Figure 4: Visual comparison of MeDiM against baselines on three tasks: (A) medical image generation (unique colors indicate the alignment between the reference report and the images generated by MeDiM), (B) medical report generation (generated report and the reference are highlighted with the same colors for matched content, while incorrect content is highlighted with red underlines), and (C) joint medical image–report pair generation (generated report and the prompt are highlight with the same colors for matched content, with green underlines denoting additional correct content consistent with the image, and red underlines marking incorrect content.).
  • Figure 5: Quantitative evaluation of MeDiM on the joint medical image–report generation task.
  • ...and 1 more figures