Table of Contents
Fetching ...

Language-Guided Trajectory Traversal in Disentangled Stable Diffusion Latent Space for Factorized Medical Image Generation

Zahra TehraniNasab, Amar Kumar, Tal Arbel

TL;DR

This work tackles the challenge of disentangling factors in medical image generation by leveraging language-guided latent traversal in a fine-tuned Stable Diffusion model conditioned via CLIP. The authors introduce a two-stage approach: fine-tune diffusion for medical data and traverse latent space along attribute-directed trajectories using swapped text embeddings during reverse diffusion, with Bézier interpolation to sample smoothly between points. A new CFRT metric and latent-direction cosine similarity assess attribute disentanglement and trajectory nonlinearity, demonstrated on CheXpert and ISIC with both qualitative and quantitative validations. The study shows that vision-language foundation models can provide explainable, controllable synthesis in medical imaging, offering potential for safer data augmentation and causal analysis while highlighting avenues for structured latent-space design.

Abstract

Text-to-image diffusion models have demonstrated a remarkable ability to generate photorealistic images from natural language prompts. These high-resolution, language-guided synthesized images are essential for the explainability of disease or exploring causal relationships. However, their potential for disentangling and controlling latent factors of variation in specialized domains like medical imaging remains under-explored. In this work, we present the first investigation of the power of pre-trained vision-language foundation models, once fine-tuned on medical image datasets, to perform latent disentanglement for factorized medical image generation and interpolation. Through extensive experiments on chest X-ray and skin datasets, we illustrate that fine-tuned, language-guided Stable Diffusion inherently learns to factorize key attributes for image generation, such as the patient's anatomical structures or disease diagnostic features. We devise a framework to identify, isolate, and manipulate key attributes through latent space trajectory traversal of generative models, facilitating precise control over medical image synthesis.

Language-Guided Trajectory Traversal in Disentangled Stable Diffusion Latent Space for Factorized Medical Image Generation

TL;DR

This work tackles the challenge of disentangling factors in medical image generation by leveraging language-guided latent traversal in a fine-tuned Stable Diffusion model conditioned via CLIP. The authors introduce a two-stage approach: fine-tune diffusion for medical data and traverse latent space along attribute-directed trajectories using swapped text embeddings during reverse diffusion, with Bézier interpolation to sample smoothly between points. A new CFRT metric and latent-direction cosine similarity assess attribute disentanglement and trajectory nonlinearity, demonstrated on CheXpert and ISIC with both qualitative and quantitative validations. The study shows that vision-language foundation models can provide explainable, controllable synthesis in medical imaging, offering potential for safer data augmentation and causal analysis while highlighting avenues for structured latent-space design.

Abstract

Text-to-image diffusion models have demonstrated a remarkable ability to generate photorealistic images from natural language prompts. These high-resolution, language-guided synthesized images are essential for the explainability of disease or exploring causal relationships. However, their potential for disentangling and controlling latent factors of variation in specialized domains like medical imaging remains under-explored. In this work, we present the first investigation of the power of pre-trained vision-language foundation models, once fine-tuned on medical image datasets, to perform latent disentanglement for factorized medical image generation and interpolation. Through extensive experiments on chest X-ray and skin datasets, we illustrate that fine-tuned, language-guided Stable Diffusion inherently learns to factorize key attributes for image generation, such as the patient's anatomical structures or disease diagnostic features. We devise a framework to identify, isolate, and manipulate key attributes through latent space trajectory traversal of generative models, facilitating precise control over medical image synthesis.

Paper Structure

This paper contains 11 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Traversal along the latent trajectories of Stable Diffusion using language guidance. Given an initial chest X-ray projected onto latent space (start point), traversal along the trajectory is performed via language guidance. Sampling along the trajectory results in only a single attribute (e.g. "medical devices", "pleural effusion") being altered from the start point ("neutral"), while the patient identity is maintained.
  • Figure 2: Reverse Diffusion for synthesizing disentangled images from text prompts using fine-tuned Stable Diffusion. The reverse diffusion process takes noisy latents, $z_T$, as input. A U-Net architecture generates the denoised latent, $z_0$. During the denoising process, the text embeddings, $e_{t}$ ($t\in{0,..,T}$), from the pre-trained CLIP encoder are added to the latent via cross-attention modules Rombach_2022_CVPR. Finally, the de-noised latent is passed through the decoder to create the synthesised image. Note that the text embeddings can be replaced at some intermediate timestep $t$ during the reverse diffusion.
  • Figure 3: Disentanglement property of the Stable Diffusion wu2023uncovering. Starting from Gaussian noise (left image) at sampling timepoint t=T, the reverse diffusion process denoises the image (right) at timepoint t=0. The text prompts for the "neutral" images (with dark borders) for CheXpert and ISIC are Chest x-ray with no significant findings and A dermoscopic image with melanocytic nevus (NV), respectively. The images on the right (matched with coloured borders) are the synthesized images with the same text prompts Chest x-ray showing Support Devices for CheXpert and a dermoscopic image with melanocytic nevus (NV) showing ink sampled at different timesteps during the reverse diffusion process. Notice that sampling closer to the timepoint t=0 results in a synthesized image similar to the original image and as we sample closer to the timepoints t=T, the patient's anatomical structure changes.
  • Figure 4: t-SNE plot of generated latent vectors of Stable Diffusion sampled from noise showing disentanglement. The dots and the images with borders show the resulting Stable Diffusion latent vectors and their corresponding "neutral" images with the text prompt - Normal chest x-ray with no significant findings. For each generated sample, we swap the original text condition to Chest x-ray showing Support Devices for one trajectory and to Chest x-ray showing Pleural Effusion for a different trajectory at multiple denoising steps during reverse diffusion.
  • Figure 5: Cosine similarity for all the attributes.
  • ...and 1 more figures