Language-Guided Trajectory Traversal in Disentangled Stable Diffusion Latent Space for Factorized Medical Image Generation
Zahra TehraniNasab, Amar Kumar, Tal Arbel
TL;DR
This work tackles the challenge of disentangling factors in medical image generation by leveraging language-guided latent traversal in a fine-tuned Stable Diffusion model conditioned via CLIP. The authors introduce a two-stage approach: fine-tune diffusion for medical data and traverse latent space along attribute-directed trajectories using swapped text embeddings during reverse diffusion, with Bézier interpolation to sample smoothly between points. A new CFRT metric and latent-direction cosine similarity assess attribute disentanglement and trajectory nonlinearity, demonstrated on CheXpert and ISIC with both qualitative and quantitative validations. The study shows that vision-language foundation models can provide explainable, controllable synthesis in medical imaging, offering potential for safer data augmentation and causal analysis while highlighting avenues for structured latent-space design.
Abstract
Text-to-image diffusion models have demonstrated a remarkable ability to generate photorealistic images from natural language prompts. These high-resolution, language-guided synthesized images are essential for the explainability of disease or exploring causal relationships. However, their potential for disentangling and controlling latent factors of variation in specialized domains like medical imaging remains under-explored. In this work, we present the first investigation of the power of pre-trained vision-language foundation models, once fine-tuned on medical image datasets, to perform latent disentanglement for factorized medical image generation and interpolation. Through extensive experiments on chest X-ray and skin datasets, we illustrate that fine-tuned, language-guided Stable Diffusion inherently learns to factorize key attributes for image generation, such as the patient's anatomical structures or disease diagnostic features. We devise a framework to identify, isolate, and manipulate key attributes through latent space trajectory traversal of generative models, facilitating precise control over medical image synthesis.
