Chest-Diffusion: A Light-Weight Text-to-Image Model for Report-to-CXR Generation
Peng Huang, Xue Gao, Lihong Huang, Jing Jiao, Xiaokang Li, Yuanyuan Wang, Yi Guo
TL;DR
The paper addresses the need for realistic, controllable chest X-ray generation from medical reports while reducing computational burden. It introduces Chest-Diffusion, a lightweight diffusion framework that uses a domain-specific domain CLIP encoder BiomedCLIP to extract report features, operates diffusion in the CXR latent space with a frozen autoencoder, and employs a compact U-ViT-based denoiser to fuse time, latent, and textual guidance with no extra modules. Key contributions include domain-aligned text embeddings, latent-space diffusion with a compact transformer denoiser, and demonstrated SOTA performance on MIMIC-CXR with 118.918 GFLOPs, achieving FID 24.456 and higher AUROC compared with RoentGen and LLM-CXR. This work advances efficient, high-fidelity report-to-CXR generation, with implications for medical education and data-driven imaging research. The forward diffusion is defined by $q(x_{1:T}|x_{0})=\prod_{t=1}^{T} q(x_{t}|x_{t-1})$ and the denoising objective $\min_{\theta}\mathbb{E}_{t,x_{0},c,\epsilon}[\|\epsilon-\epsilon_{\theta}(x_{t},t)\|_{2}^{2}]$, enabling stable learning of compact generative models in the medical domain.
Abstract
Text-to-image generation has important implications for generation of diverse and controllable images. Several attempts have been made to adapt Stable Diffusion (SD) to the medical domain. However, the large distribution difference between medical reports and natural texts, as well as high computational complexity in common stable diffusion limit the authenticity and feasibility of the generated medical images. To solve above problems, we propose a novel light-weight transformer-based diffusion model learning framework, Chest-Diffusion, for report-to-CXR generation. Chest-Diffusion employs a domain-specific text encoder to obtain accurate and expressive text features to guide image generation, improving the authenticity of the generated images. Meanwhile, we introduce a light-weight transformer architecture as the denoising model, reducing the computational complexity of the diffusion model. Experiments demonstrate that our Chest-Diffusion achieves the lowest FID score 24.456, under the computation budget of 118.918 GFLOPs, which is nearly one-third of the computational complexity of SD.
