Table of Contents
Fetching ...

Unified Cross-Modal Medical Image Synthesis with Hierarchical Mixture of Product-of-Experts

Reuben Dorent, Nazim Haouchine, Alexandra Golby, Sarah Frisken, Tina Kapur, William Wells

TL;DR

This work introduces MMHVAE, a hierarchical mixture of multimodal VAEs that synthesizes missing medical images from partial observations by modeling the variational posterior as a mixture of product-of-experts. The approach constructs a deep, multi-level latent representation and a principled fusion mechanism to align latent factors across modalities, while regularizing non-observed distributions with GAN losses. It enables cross-modal synthesis with incomplete training data and demonstrates strong performance on brain MRI and intraoperative ultrasound tasks, including harmonized synthesis, brain tumor segmentation from synthetic iUS, and improved MR–iUS registration. The method yields sharper, more realistic multimodal reconstructions, better downstream task performance, and favorable computational efficiency relative to contemporary unified synthesis models.

Abstract

We propose a deep mixture of multimodal hierarchical variational auto-encoders called MMHVAE that synthesizes missing images from observed images in different modalities. MMHVAE's design focuses on tackling four challenges: (i) creating a complex latent representation of multimodal data to generate high-resolution images; (ii) encouraging the variational distributions to estimate the missing information needed for cross-modal image synthesis; (iii) learning to fuse multimodal information in the context of missing data; (iv) leveraging dataset-level information to handle incomplete data sets at training time. Extensive experiments are performed on the challenging problem of pre-operative brain multi-parametric magnetic resonance and intra-operative ultrasound imaging.

Unified Cross-Modal Medical Image Synthesis with Hierarchical Mixture of Product-of-Experts

TL;DR

This work introduces MMHVAE, a hierarchical mixture of multimodal VAEs that synthesizes missing medical images from partial observations by modeling the variational posterior as a mixture of product-of-experts. The approach constructs a deep, multi-level latent representation and a principled fusion mechanism to align latent factors across modalities, while regularizing non-observed distributions with GAN losses. It enables cross-modal synthesis with incomplete training data and demonstrates strong performance on brain MRI and intraoperative ultrasound tasks, including harmonized synthesis, brain tumor segmentation from synthetic iUS, and improved MR–iUS registration. The method yields sharper, more realistic multimodal reconstructions, better downstream task performance, and favorable computational efficiency relative to contemporary unified synthesis models.

Abstract

We propose a deep mixture of multimodal hierarchical variational auto-encoders called MMHVAE that synthesizes missing images from observed images in different modalities. MMHVAE's design focuses on tackling four challenges: (i) creating a complex latent representation of multimodal data to generate high-resolution images; (ii) encouraging the variational distributions to estimate the missing information needed for cross-modal image synthesis; (iii) learning to fuse multimodal information in the context of missing data; (iv) leveraging dataset-level information to handle incomplete data sets at training time. Extensive experiments are performed on the challenging problem of pre-operative brain multi-parametric magnetic resonance and intra-operative ultrasound imaging.

Paper Structure

This paper contains 39 sections, 32 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Graphical models of: (a) variational auto-encoder (VAE); (b) hierarchical VAE (HVAE); (c) multimodal VAE (MVAE); (d) Our mixture of multimodal hierarchical VAE with missing data. Observed variables are in grey. In our model, variables are not always observed (partially gray).
  • Figure 2: The neural networks implementing (a) the encoder $q(\bm{z}|\bm{x})$ and decoder $p(\bm{x}|z_1)$; (b) the encoder $q(\bm{z}|x_1)$ and decoder $p(\bm{x}|z_1)$ for a $L=3$ group hierarchical VAE with $M=2$ modalities.
  • Figure 3: Qualitative comparison of our method with all competing methods for synthesizing all modalities (iUS, T2, ceT1, FLAIR) from (a) iUS; (b) T2. Our approach generates sharper images with better contrast differentiation between tissues and modality-specific patterns (e.g. speckles for iUS).
  • Figure 4: Principal Component Analysis on the first three components (PC1: $39\%$, PC2: $16\%$, PC3: $13\%$) of the latent variable at the highest level $z_1$ estimated using (a) iUS+T2+ceT1+FLAIR; (b) iUS; (c) T2 (d) ceT1 (e) FLAIR as input. Similar representations are obtained for all combinations, in particular in the tumor region (red) and around the ventricle (blue).
  • Figure 5: Impact of the temperature $T$ on the quality of the reconstructed images for a) iUS to T2; b) iUS to ceT1 synthesis.
  • ...and 1 more figures

Theorems & Definitions (2)

  • proof
  • proof