Vision-Language Generative Model for View-Specific Chest X-ray Generation
Hyungyung Lee, Da Young Lee, Wonjae Kim, Jin-Hwa Kim, Tackeun Kim, Jihang Kim, Leonard Sunwoo, Edward Choi
TL;DR
ViewXGen tackles the gap in chest X-ray synthesis by enabling view-specific generation through dedicated per-view tokens and by integrating multi-view inputs. It crafts a unified pipeline using VQ-GAN image tokens, Byte-level BPE text tokens, and a Transformer with a multimodal causal mask executed efficiently via the Performer with FAVOR+ to handle long sequences. The approach achieves superior realism and clinical fidelity, outperforming fine-tuned Stable Diffusion and retrieval-based baselines, and demonstrates clear advantages of a unified multi-view model over single-view counterparts. While promising, it acknowledges limitations related to report phrasing and fine-grained details, outlining future work on dataset refinements and extending capabilities to radiology report generation.
Abstract
Synthetic medical data generation has opened up new possibilities in the healthcare domain, offering a powerful tool for simulating clinical scenarios, enhancing diagnostic and treatment quality, gaining granular medical knowledge, and accelerating the development of unbiased algorithms. In this context, we present a novel approach called ViewXGen, designed to overcome the limitations of existing methods that rely on general domain pipelines using only radiology reports to generate frontal-view chest X-rays. Our approach takes into consideration the diverse view positions found in the dataset, enabling the generation of chest X-rays with specific views, which marks a significant advancement in the field. To achieve this, we introduce a set of specially designed tokens for each view position, tailoring the generation process to the user's preferences. Furthermore, we leverage multi-view chest X-rays as input, incorporating valuable information from different views within the same study. This integration rectifies potential errors and contributes to faithfully capturing abnormal findings in chest X-ray generation. To validate the effectiveness of our approach, we conducted statistical analyses, evaluating its performance in a clinical efficacy metric on the MIMIC-CXR dataset. Also, human evaluation demonstrates the remarkable capabilities of ViewXGen, particularly in producing realistic view-specific X-rays that closely resemble the original images.
