A Unified Framework for Multimodal Image Reconstruction and Synthesis using Denoising Diffusion Models

Weijie Gan; Xucheng Wang; Tongyao Wang; Wenshang Wang; Chunwei Ying; Yuyang Hu; Yasheng Chen; Hongyu An; Ulugbek S. Kamilov

A Unified Framework for Multimodal Image Reconstruction and Synthesis using Denoising Diffusion Models

Weijie Gan, Xucheng Wang, Tongyao Wang, Wenshang Wang, Chunwei Ying, Yuyang Hu, Yasheng Chen, Hongyu An, Ulugbek S. Kamilov

TL;DR

Any2all presents a unified diffusion-based framework that treats multimodal image reconstruction and synthesis as a virtual inpainting problem. By training a single unconditional DDPM on a complete multimodal data stack and applying task-adaptive samplers (MPS and MDS) at inference, it can map any available input configuration to all desired modalities. The approach achieves competitive distortion-based performance while delivering superior perceptual quality across reconstruction and synthesis tasks, validated on a PET/MR/CT brain dataset. This framework has the potential to simplify clinical workflows by replacing many task-specific models with one flexible model, albeit with trade-offs in inference speed that motivate future acceleration work. The work demonstrates the versatility of a unified generative prior for diverse multimodal imaging tasks and highlights the balance between perceptual realism and quantitative fidelity in practical deployments.

Abstract

Image reconstruction and image synthesis are important for handling incomplete multimodal imaging data, but existing methods require various task-specific models, complicating training and deployment workflows. We introduce Any2all, a unified framework that addresses this limitation by formulating these disparate tasks as a single virtual inpainting problem. We train a single, unconditional diffusion model on the complete multimodal data stack. This model is then adapted at inference time to ``inpaint'' all target modalities from any combination of inputs of available clean images or noisy measurements. We validated Any2all on a PET/MR/CT brain dataset. Our results show that Any2all can achieve excellent performance on both multimodal reconstruction and synthesis tasks, consistently yielding images with competitive distortion-based performance and superior perceptual quality over specialized methods.

A Unified Framework for Multimodal Image Reconstruction and Synthesis using Denoising Diffusion Models

TL;DR

Abstract

Paper Structure (23 sections, 10 equations, 6 figures, 1 table)

This paper contains 23 sections, 10 equations, 6 figures, 1 table.

Introduction
Related Work
Multimodal Image Reconstruction
Multimodal Image Synthesis
Method
Problem Formulation
Model Training
Model Inference
Multimodal Posterior Sampling
Multimodal Decomposition Sampling
Numerical Validation
Dataset
Experimental Setup
Comparison Methods
Results
...and 8 more sections

Figures (6)

Figure 1: An illustration of the training and inference pipelines for Any2all. (a, b): Our framework is founded on the concept of a unified virtual inpainting problem, which treats diverse image reconstruction and synthesis tasks as the common goal of restoring missing or corrupted information within a complete set of multimodal images. (c): To solve this, we train a single, unconditional diffusion model that serves as a powerful generative prior. (d, e): During inference, we propose task-adaptive sampling algorithms, namely MPS and MDS, to steer the generative process by enforcing constraints from the available data. This figure shows the ability of Any2all to use a single pre-trained model to map any input data to all clean, multimodal images.
Figure 2: Qualitative and quantitative evaluation of unconditional image generation. (a): Quantitative evaluation shows that Any2all achieves superior perceptual quality, reflected by lower FID scores across all modalities compared to baseline methods. The Any2all results are generated using a standard DDPM sampler from a single, unconditionally trained model. (b): Qualitative comparison of T1 MRI and PET images generated by Any2all and M2DN (a baseline trained on mixed conditional/unconditional tasks). This figure shows that the images from Any2all exhibit finer anatomical details and fewer artifacts, visually confirming the superior quality suggested by the quantitative metrics.
Figure 3: Qualitative and quantitative evaluation of T2FLAIR MRI reconstruction from noisy, 4x undersampled measurements, comparing Any2all with task-specific baselines. (a): Quantitative comparison showing the effect of adding different auxiliary inputs. "y" denotes using only the measurements. "+T1", "+T1&T2*", and "+T1&T2*&CT&PET" indicate the addition of T1 MRI, T1 and T2* MRI, and all other modalities as auxiliary inputs, respectively. (b): Visual results of reconstruction using Any2all (MDS), illustrating how image quality improves with more auxiliary data. (c): Quantitative comparison in the "+T1" setting. Unlike Any2all, the baseline methods were trained specifically for this task. (d): Visual comparison for reconstruction in the "+T1" setting. This figure shows that: (1) reconstruction quality improves as more auxiliary modalities are provided; (2) within our framework, MDS excels on distortion-based metrics while MPS achieves superior perceptual quality; and (3) compared to baselines, Any2all achieves the best perceptual quality (lower FID) while remaining competitive in fidelity.
Figure 4: Qualitative and quantitative evaluation of CT image synthesis, comparing Any2all with several baseline methods designed for multimodal image synthesis. (a): Quantitative results for CT synthesis from different input modalities. Here, "+T1" denotes synthesis from only T1 images, while "+T1&T2*" and "+T1&T2FLAIR&T2*" indicate the addition of T2* MRI and all available MRI modalities as inputs, respectively. (b): Visual results for CT synthesis given all MRI modalities as input (the "+T1&T2FLAIR&T2*" setup). (c): Quantitative results for CT synthesis showcasing Any2all's unique ability to handle mixed inputs (i.e., both raw measurements and clean images). Note that the baseline methods (mmGAN, mmResViT, M2DN) only allow clean images as input. This figure shows that (1) the performance of Any2all progressively improves as more information—either from additional modalities or more complete measurements—is provided, (2) for image synthesis, MPS excels on both distortion-based and perceptual metrics, and (3) compared to baselines, Any2all achieves the best perceptual quality (lower FID) while remaining competitive in fidelity.
Figure 5: Qualitative and quantitative evaluation of PET image synthesis, comparing Any2all with several baseline methods designed for multimodal image synthesis. (a): Quantitative results for PET image synthesis. Here, "+T1" denotes synthesis from only T1 images, while "+T1&T2*" and "+T1&T2FLAIR&T2*" indicate the addition of T2* MRI and all available MRI modalities as inputs, respectively. (b): Visual results from Any2all (MPS), M2DN, and mmResViT for PET synthesis, given either T1 MR images or all MRI modalities as input. This figure shows that (1) the performance of Any2all progressively improves as more information is provided, (2) for image synthesis, MPS excels on both distortion-based and perceptual metrics, and (3) compared to baselines, Any2all achieves the best perceptual quality (lower FID) while remaining competitive in fidelity. Note how MPS generates images with a perceptual quality that consistently matches the ground truth, whereas the baseline methods produce oversmoothed results that lack fine details.
...and 1 more figures

A Unified Framework for Multimodal Image Reconstruction and Synthesis using Denoising Diffusion Models

TL;DR

Abstract

A Unified Framework for Multimodal Image Reconstruction and Synthesis using Denoising Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)