Table of Contents
Fetching ...

MeDUET: Disentangled Unified Pretraining for 3D Medical Image Synthesis and Analysis

Junkai Liu, Ling Shao, Le Zhang

TL;DR

MeDUET introduces a disentangled unified pretraining framework for 3D medical imaging that learns domain-invariant content and domain-specific style factors directly in the VAE latent space. Two proxy tasks, Mixed Factor Token Distillation (MFTD) and Swap-invariance Quadruplet Contrast (SiQC), reinforce factor identifiability and robust cross-domain performance. The model yields faster convergence and higher fidelity in synthesis while delivering strong domain generalization and data efficiency for analysis across diverse benchmarks. Empirical results across five datasets demonstrate state-of-the-art performance in both generative and analytical tasks, with evidence of transferability to unseen domains and modalities. This approach highlights the practical impact of factor-disentangled latent representations for unified pretraining in 3D medical imaging.

Abstract

Self-supervised learning (SSL) and diffusion models have advanced representation learning and image synthesis. However, in 3D medical imaging, they remain separate: diffusion for synthesis, SSL for analysis. Unifying 3D medical image synthesis and analysis is intuitive yet challenging, as multi-center datasets exhibit dominant style shifts, while downstream tasks rely on anatomy, and site-specific style co-varies with anatomy across slices, making factors unreliable without explicit constraints. In this paper, we propose MeDUET, a 3D Medical image Disentangled UnifiEd PreTraining framework that performs SSL in the Variational Autoencoder (VAE) latent space which explicitly disentangles domain-invariant content from domain-specific style. The token demixing mechanism serves to turn disentanglement from a modeling assumption into an empirically identifiable property. Two novel proxy tasks, Mixed-Factor Token Distillation (MFTD) and Swap-invariance Quadruplet Contrast (SiQC), are devised to synergistically enhance disentanglement. Once pretrained, MeDUET is capable of (i) delivering higher fidelity, faster convergence, and improved controllability for synthesis, and (ii) demonstrating strong domain generalization and notable label efficiency for analysis across diverse medical benchmarks. In summary, MeDUET converts multi-source heterogeneity from an obstacle into a learning signal, enabling unified pretraining for 3D medical image synthesis and analysis. The code is available at https://github.com/JK-Liu7/MeDUET .

MeDUET: Disentangled Unified Pretraining for 3D Medical Image Synthesis and Analysis

TL;DR

MeDUET introduces a disentangled unified pretraining framework for 3D medical imaging that learns domain-invariant content and domain-specific style factors directly in the VAE latent space. Two proxy tasks, Mixed Factor Token Distillation (MFTD) and Swap-invariance Quadruplet Contrast (SiQC), reinforce factor identifiability and robust cross-domain performance. The model yields faster convergence and higher fidelity in synthesis while delivering strong domain generalization and data efficiency for analysis across diverse benchmarks. Empirical results across five datasets demonstrate state-of-the-art performance in both generative and analytical tasks, with evidence of transferability to unseen domains and modalities. This approach highlights the practical impact of factor-disentangled latent representations for unified pretraining in 3D medical imaging.

Abstract

Self-supervised learning (SSL) and diffusion models have advanced representation learning and image synthesis. However, in 3D medical imaging, they remain separate: diffusion for synthesis, SSL for analysis. Unifying 3D medical image synthesis and analysis is intuitive yet challenging, as multi-center datasets exhibit dominant style shifts, while downstream tasks rely on anatomy, and site-specific style co-varies with anatomy across slices, making factors unreliable without explicit constraints. In this paper, we propose MeDUET, a 3D Medical image Disentangled UnifiEd PreTraining framework that performs SSL in the Variational Autoencoder (VAE) latent space which explicitly disentangles domain-invariant content from domain-specific style. The token demixing mechanism serves to turn disentanglement from a modeling assumption into an empirically identifiable property. Two novel proxy tasks, Mixed-Factor Token Distillation (MFTD) and Swap-invariance Quadruplet Contrast (SiQC), are devised to synergistically enhance disentanglement. Once pretrained, MeDUET is capable of (i) delivering higher fidelity, faster convergence, and improved controllability for synthesis, and (ii) demonstrating strong domain generalization and notable label efficiency for analysis across diverse medical benchmarks. In summary, MeDUET converts multi-source heterogeneity from an obstacle into a learning signal, enabling unified pretraining for 3D medical image synthesis and analysis. The code is available at https://github.com/JK-Liu7/MeDUET .
Paper Structure (33 sections, 28 equations, 10 figures, 15 tables, 1 algorithm)

This paper contains 33 sections, 28 equations, 10 figures, 15 tables, 1 algorithm.

Figures (10)

  • Figure 1: The motivation of our proposed MeDUET. (a) Latent similarity heatmap across domains ($S_{inter}$ / $S_{intra}$: inter-/intra-domain similarity). Compared with baseline medical SSL, which exhibits site-driven feature blocks, MeDUET isolates domain shifts within the style map while maintaining a uniformly consistent content map across domains. (b) Latent t-SNE colored by domain. The baseline SSL clusters embeddings primarily by domain rather than anatomy, indicating style-dominated representations, whereas MeDUET disentangles content and style in separate embedding spaces, enhancing factor identifiability.
  • Figure 2: (a) Comparison between existing medical image synthesis/analysis paradigms and our unified strategy. (b) Overview of our proposed MeDUET.
  • Figure 3: The overall framework of our MeDUET: (a) Mixing & Dual Reconstruction aims to perform demixing between two mixed patch tokens. (b) Factor Disentanglement module explicitly decomposes encoded latent patches into content and style representations, using a domain classifier to empower them as domain-invariant and domain-identifiable, respectively. (c) MFTD performs knowledge distillation on the mixed regions within the factor space. (d) SiQC enforces contrastive consistency within the factor space, encouraging invariance to the swapped factor while preserving the discriminability of the retained one.
  • Figure 4: Qualitative comparison of synthesized volumes.
  • Figure 5: Convergence speed comparison. Left: Convergence acceleration for DiT. Right: Convergence acceleration for SiT. $\dagger$: using pre-defined metadata. $\ddagger$: using learned content and style.
  • ...and 5 more figures