Table of Contents
Fetching ...

Imagining Alternatives: Towards High-Resolution 3D Counterfactual Medical Image Generation via Language Guidance

Mohamed Mohamed, Brennan Nichyporuk, Douglas L. Arnold, Tal Arbel

TL;DR

This work tackles the lack of high-resolution 3D, language-guided counterfactual medical image generation by introducing a native-3D diffusion framework conditioned via BiomedCLIP, augmented with Simple Diffusion techniques and cross-attention. It compares a voxel-space, wavelet-based diffusion model (WDM) with a 3D latent diffusion baseline and employs a Rectified Flow (MAISI RFlow) noise schedule to improve fidelity and efficiency, including classifier-free guidance to balance prompt fidelity and anatomical preservation. The method is validated on MS and ADNI brain MRI datasets, demonstrating the ability to generate realistic 3D counterfactuals with varying lesion loads and cognitive states, while preserving subject identity via shared noise $x_T \sim \mathcal{N}(0,I)$. Importantly, MAISI RFlow reduces memory and compute by about 65% while delivering performance close to WDM, supporting rapid experimentation and potential clinical translation for personalized disease modeling and 3D medical education.

Abstract

Vision-language models have demonstrated impressive capabilities in generating 2D images under various conditions; however, the success of these models is largely enabled by extensive, readily available pretrained foundation models. Critically, comparable pretrained models do not exist for 3D, significantly limiting progress. As a result, the potential of vision-language models to produce high-resolution 3D counterfactual medical images conditioned solely on natural language remains unexplored. Addressing this gap would enable powerful clinical and research applications, such as personalized counterfactual explanations, simulation of disease progression, and enhanced medical training by visualizing hypothetical conditions in realistic detail. Our work takes a step toward this challenge by introducing a framework capable of generating high-resolution 3D counterfactual medical images of synthesized patients guided by free-form language prompts. We adapt state-of-the-art 3D diffusion models with enhancements from Simple Diffusion and incorporate augmented conditioning to improve text alignment and image quality. To our knowledge, this is the first demonstration of a language-guided native-3D diffusion model applied to neurological imaging, where faithful three-dimensional modeling is essential. On two neurological MRI datasets, our framework simulates varying counterfactual lesion loads in Multiple Sclerosis and cognitive states in Alzheimer's disease, generating high-quality images while preserving subject fidelity. Our results lay the groundwork for prompt-driven disease progression analysis in 3D medical imaging. Project link - https://lesupermomo.github.io/imagining-alternatives/.

Imagining Alternatives: Towards High-Resolution 3D Counterfactual Medical Image Generation via Language Guidance

TL;DR

This work tackles the lack of high-resolution 3D, language-guided counterfactual medical image generation by introducing a native-3D diffusion framework conditioned via BiomedCLIP, augmented with Simple Diffusion techniques and cross-attention. It compares a voxel-space, wavelet-based diffusion model (WDM) with a 3D latent diffusion baseline and employs a Rectified Flow (MAISI RFlow) noise schedule to improve fidelity and efficiency, including classifier-free guidance to balance prompt fidelity and anatomical preservation. The method is validated on MS and ADNI brain MRI datasets, demonstrating the ability to generate realistic 3D counterfactuals with varying lesion loads and cognitive states, while preserving subject identity via shared noise . Importantly, MAISI RFlow reduces memory and compute by about 65% while delivering performance close to WDM, supporting rapid experimentation and potential clinical translation for personalized disease modeling and 3D medical education.

Abstract

Vision-language models have demonstrated impressive capabilities in generating 2D images under various conditions; however, the success of these models is largely enabled by extensive, readily available pretrained foundation models. Critically, comparable pretrained models do not exist for 3D, significantly limiting progress. As a result, the potential of vision-language models to produce high-resolution 3D counterfactual medical images conditioned solely on natural language remains unexplored. Addressing this gap would enable powerful clinical and research applications, such as personalized counterfactual explanations, simulation of disease progression, and enhanced medical training by visualizing hypothetical conditions in realistic detail. Our work takes a step toward this challenge by introducing a framework capable of generating high-resolution 3D counterfactual medical images of synthesized patients guided by free-form language prompts. We adapt state-of-the-art 3D diffusion models with enhancements from Simple Diffusion and incorporate augmented conditioning to improve text alignment and image quality. To our knowledge, this is the first demonstration of a language-guided native-3D diffusion model applied to neurological imaging, where faithful three-dimensional modeling is essential. On two neurological MRI datasets, our framework simulates varying counterfactual lesion loads in Multiple Sclerosis and cognitive states in Alzheimer's disease, generating high-quality images while preserving subject fidelity. Our results lay the groundwork for prompt-driven disease progression analysis in 3D medical imaging. Project link - https://lesupermomo.github.io/imagining-alternatives/.

Paper Structure

This paper contains 12 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Proposed Framework. A pretrained BiomedCLIP text encoder encodes the text prompt (e.g. "Subject has high lesion load") as conditioning for the diffusion model. During inference, the model generates counterfactuals by sampling from the same fixed noise while varying the text condition.
  • Figure 2: Qualitative comparison of generated counterfactuals for synthesized subjects on the MS dataset for different lesion loads.
  • Figure 3: Qualitative comparison of generated counterfactuals for synthesized subjects on the ADNI dataset for different cognitive states.