Table of Contents
Fetching ...

Visually-Guided Controllable Medical Image Generation via Fine-Grained Semantic Disentanglement

Xin Huang, Junjie Liang, Qingshan Hou, Peng Cao, Jinzhu Yang, Xiaoli Liu, Osmar R. Zaiane

TL;DR

A cross-modal latent alignment mechanism is introduced that leverages visual priors to explicitly disentangle unstructured text into independent semantic representations and outperforms existing approaches in terms of generation quality and significantly improves performance on downstream classification tasks.

Abstract

Medical image synthesis is crucial for alleviating data scarcity and privacy constraints. However, fine-tuning general text-to-image (T2I) models remains challenging, mainly due to the significant modality gap between complex visual details and abstract clinical text. In addition, semantic entanglement persists, where coarse-grained text embeddings blur the boundary between anatomical structures and imaging styles, thus weakening controllability during generation. To address this, we propose a Visually-Guided Text Disentanglement framework. We introduce a cross-modal latent alignment mechanism that leverages visual priors to explicitly disentangle unstructured text into independent semantic representations. Subsequently, a Hybrid Feature Fusion Module (HFFM) injects these features into a Diffusion Transformer (DiT) through separated channels, enabling fine-grained structural control. Experimental results in three datasets demonstrate that our method outperforms existing approaches in terms of generation quality and significantly improves performance on downstream classification tasks. The source code is available at https://github.com/hx111/VG-MedGen.

Visually-Guided Controllable Medical Image Generation via Fine-Grained Semantic Disentanglement

TL;DR

A cross-modal latent alignment mechanism is introduced that leverages visual priors to explicitly disentangle unstructured text into independent semantic representations and outperforms existing approaches in terms of generation quality and significantly improves performance on downstream classification tasks.

Abstract

Medical image synthesis is crucial for alleviating data scarcity and privacy constraints. However, fine-tuning general text-to-image (T2I) models remains challenging, mainly due to the significant modality gap between complex visual details and abstract clinical text. In addition, semantic entanglement persists, where coarse-grained text embeddings blur the boundary between anatomical structures and imaging styles, thus weakening controllability during generation. To address this, we propose a Visually-Guided Text Disentanglement framework. We introduce a cross-modal latent alignment mechanism that leverages visual priors to explicitly disentangle unstructured text into independent semantic representations. Subsequently, a Hybrid Feature Fusion Module (HFFM) injects these features into a Diffusion Transformer (DiT) through separated channels, enabling fine-grained structural control. Experimental results in three datasets demonstrate that our method outperforms existing approaches in terms of generation quality and significantly improves performance on downstream classification tasks. The source code is available at https://github.com/hx111/VG-MedGen.
Paper Structure (9 sections, 2 equations, 4 figures, 3 tables)

This paper contains 9 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Impact of text prompts for comparable methods on image synthesis. Anatomical structures and modality styles are highlighted in red and blue, respectively.
  • Figure 2: The overall architecture of our proposed Visually-Guided Text Disentanglement Diffusion Framework. Note that only the lightweight feature projector and the Transformer with LoRA adapters are required in the inference stage.
  • Figure 3: The t-SNE feature visualization before and after modality alignment. The baseline (Left) exhibits a significant modality gap, where text embeddings are isolated from visual features. In contrast, with our alignment module, the text features are closely aligned with the visual priors in the Anatomy (Middle) and Style (Right) subspaces.
  • Figure 4: Visual comparison of image synthesis results.