Towards General Text-guided Image Synthesis for Customized Multimodal Brain MRI Generation

Yulin Wang; Honglin Xiong; Kaicong Sun; Shuwei Bai; Ling Dai; Zhongxiang Ding; Jiameng Liu; Qian Wang; Qian Liu; Dinggang Shen

Towards General Text-guided Image Synthesis for Customized Multimodal Brain MRI Generation

Yulin Wang, Honglin Xiong, Kaicong Sun, Shuwei Bai, Ling Dai, Zhongxiang Ding, Jiameng Liu, Qian Wang, Qian Liu, Dinggang Shen

TL;DR

TUMSyn can be utilized along with acquired MR scan(s) to facilitate large-scale MRI-based screening and diagnosis of brain diseases and can generate clinically meaningful MR images with specified imaging metadata in supervised and zero-shot scenarios.

Abstract

Multimodal brain magnetic resonance (MR) imaging is indispensable in neuroscience and neurology. However, due to the accessibility of MRI scanners and their lengthy acquisition time, multimodal MR images are not commonly available. Current MR image synthesis approaches are typically trained on independent datasets for specific tasks, leading to suboptimal performance when applied to novel datasets and tasks. Here, we present TUMSyn, a Text-guided Universal MR image Synthesis generalist model, which can flexibly generate brain MR images with demanded imaging metadata from routinely acquired scans guided by text prompts. To ensure TUMSyn's image synthesis precision, versatility, and generalizability, we first construct a brain MR database comprising 31,407 3D images with 7 MRI modalities from 13 centers. We then pre-train an MRI-specific text encoder using contrastive learning to effectively control MR image synthesis based on text prompts. Extensive experiments on diverse datasets and physician assessments indicate that TUMSyn can generate clinically meaningful MR images with specified imaging metadata in supervised and zero-shot scenarios. Therefore, TUMSyn can be utilized along with acquired MR scan(s) to facilitate large-scale MRI-based screening and diagnosis of brain diseases.

Towards General Text-guided Image Synthesis for Customized Multimodal Brain MRI Generation

TL;DR

Abstract

Paper Structure (22 sections, 3 equations, 8 figures, 1 table)

This paper contains 22 sections, 3 equations, 8 figures, 1 table.

Results
Discussion
Methods

Figures (8)

Figure 1: Overview of our study. a, Distribution of metadata (dataset and MRI modality) and demographic information (age and gender) along with the amount of images for each category in our database. The geographical representation of data distribution is presented below. b, Workflow of pre-training text encoder, showing the construction of text prompts and the use of CLIP for aligning embeddings from text encoder with the ones from image encoder. c, Workflow of training image synthesis model. The pretrained text encoder in Fig.1 b is frozen during the training in this stage. The CNN encoder, dealing with cropped patches instead of the whole volume, is distinct from stage Fig.1 b and is trained as image encoder in this stage. Cross-attention is adopted to integrate text embeddings into image embeddings to control image synthesis. The local implict image function (LIIF) is employed to decode the embeddings and generate images with arbitrary upscaling factors.
Figure 1: Detailed illustration of our model architecture. a, Architecture of the image encoder in BMLIP, which is used for pre-training text encoder. b, Architecture of the CNN encoder in the image synthesis model, which is built on a 24-layer ResNet. c, Architecture of the LIIF-based image decoder in the image synthesis model.
Figure 1: Zero-shot performance of our text encoder on (a) image-to-text and (b) image-to-modality retrieval. Given an image as input, the text with highest cosine similarity match with the image embedding is retrieved. For image-to-text retrieval, 10 randomly selected complete text prompts are used and for image-to-modality retrieval, 7 image modalities are used as text prompts. We demonstrate several examples for both cases. For all the examples, we denote the retrieval probability using blue bar. Gold box indicates the ground-truth text prompt for the given image.
Figure 2: Application of TUMSyn in clinical workflow to supplement MRI scanning. a, Integrating TUMSyn into MR imaging workflow. TUMSyn, with only 114M parameters, can be easily employed to generate unacquired MRI sequences, governed by text prompts, from seven commonly used MRI sequences in clinics across diverse MRI scanners. Examplar text prompts of target 3D MR images are shown below. The elements in the first curly bracket represent the demanded voxel size of target images, and the elements in the second curly bracket represent the target MR imaging parameters including TR (ms), TE (ms), TI (ms) and FA (degree). The tornado diagram in bottom left of Fig.2 a shows execution time of real MRI scans (left) and image synthesis by our TUMSyn (right) for five representative MRI sequences, with corresponding imaging parameters listed below each bar. b, Synthesized images by our TUMSyn from multi-contrast inputs with diverse scanning orientations and spatial resolutions in multiple data centers.
Figure 3: Quantitative evaluation of synthesis accuracy and versatility of TUMSyn across eight representative MRI synthesis tasks on internal test sets. Comparison with state-of-the-art synthesis models is conducted in terms of PSNR and SSIM, and shown in barplots with mean, median, upper and lower quartiles.
...and 3 more figures

Towards General Text-guided Image Synthesis for Customized Multimodal Brain MRI Generation

TL;DR

Abstract

Towards General Text-guided Image Synthesis for Customized Multimodal Brain MRI Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)