Vision-Language Generative Model for View-Specific Chest X-ray Generation

Hyungyung Lee; Da Young Lee; Wonjae Kim; Jin-Hwa Kim; Tackeun Kim; Jihang Kim; Leonard Sunwoo; Edward Choi

Vision-Language Generative Model for View-Specific Chest X-ray Generation

Hyungyung Lee, Da Young Lee, Wonjae Kim, Jin-Hwa Kim, Tackeun Kim, Jihang Kim, Leonard Sunwoo, Edward Choi

TL;DR

ViewXGen tackles the gap in chest X-ray synthesis by enabling view-specific generation through dedicated per-view tokens and by integrating multi-view inputs. It crafts a unified pipeline using VQ-GAN image tokens, Byte-level BPE text tokens, and a Transformer with a multimodal causal mask executed efficiently via the Performer with FAVOR+ to handle long sequences. The approach achieves superior realism and clinical fidelity, outperforming fine-tuned Stable Diffusion and retrieval-based baselines, and demonstrates clear advantages of a unified multi-view model over single-view counterparts. While promising, it acknowledges limitations related to report phrasing and fine-grained details, outlining future work on dataset refinements and extending capabilities to radiology report generation.

Abstract

Synthetic medical data generation has opened up new possibilities in the healthcare domain, offering a powerful tool for simulating clinical scenarios, enhancing diagnostic and treatment quality, gaining granular medical knowledge, and accelerating the development of unbiased algorithms. In this context, we present a novel approach called ViewXGen, designed to overcome the limitations of existing methods that rely on general domain pipelines using only radiology reports to generate frontal-view chest X-rays. Our approach takes into consideration the diverse view positions found in the dataset, enabling the generation of chest X-rays with specific views, which marks a significant advancement in the field. To achieve this, we introduce a set of specially designed tokens for each view position, tailoring the generation process to the user's preferences. Furthermore, we leverage multi-view chest X-rays as input, incorporating valuable information from different views within the same study. This integration rectifies potential errors and contributes to faithfully capturing abnormal findings in chest X-ray generation. To validate the effectiveness of our approach, we conducted statistical analyses, evaluating its performance in a clinical efficacy metric on the MIMIC-CXR dataset. Also, human evaluation demonstrates the remarkable capabilities of ViewXGen, particularly in producing realistic view-specific X-rays that closely resemble the original images.

Vision-Language Generative Model for View-Specific Chest X-ray Generation

TL;DR

Abstract

Paper Structure (44 sections, 10 equations, 5 figures, 7 tables)

This paper contains 44 sections, 10 equations, 5 figures, 7 tables.

Introduction
Related Works
Chest X-ray Generation
Image Tokenization
Efficient Transformer
Method
Input Embedding
Image Tokenization
Chest X-ray Embedding
Radiology Report Embedding
Multi-view Chest X-ray Generative Model
Experiments
Dataset
Evaluation Metrics
Statistical Evaluation
...and 29 more sections

Figures (5)

Figure 1: We introduce a view-specific chest X-ray generation model. ViewXGen leverages view-specific special tokens to empower its ability to capture unique features from different views. Additionally, the integration of multi-view chest X-rays as input enhances the overall generation quality.
Figure 2: Overview of ViewXGen architecture. (a) ViewXGen is designed to generate chest X-rays with specific views, such as AP, PA, and Lateral views. (b) Images are tokenized via VQ-GAN, and reports are tokenized via a byte-level BPE tokenizer. (c) A minibatch consists of input sequences consisting of AP/PA/Lateral X-rays and a report in random order. (d) We use a causal attention mask to simultaneously handle multi-view X-rays and a report.
Figure 3: Generated chest X-rays of ViewXGen. (a) Based only on the report, the generated PA in the orange dashed box draws a rather small portion of the consolidation in the lingula, as is written in the report. Based on an additional lateral view, the generated PA in the blue dashed box draws a consolidation that is of more similar size as that of the original PA. (b) The generated PA conditioned only on the report (orange dashed box) draws relatively small-sized pleural effusion while the report says "large right pleural effusion". However, by adding an additional lateral view (blue dashed box), ViewXGen can properly generate the PA view with large pleural effusion.
Figure 4: Generated radiology reports of ViewXGen. (a) Regardless of the number of chest X-rays input, ViewXGen can generate accurate radiology reports covering all diseases mentioned in the original report. (b) The generated report only from a single chest X-ray (orange dashed box) cannot fully capture the abnormalities in the given X-ray. With an additional chest X-ray, ViewXGen can generate a more precise report (blue dashed box) containing all diseases as described in the original report.
Figure 5: These examples highlight the advanced capabilities of our approach to generate images that accurately incorporate details, even those not explicitly stated or omitted in the reports. In contrast, they underline the limitations of a purely retrieval-based approach, which often fails to capture essential patient information such as gender or specific health conditions like obesity, especially when faced with incomplete or erroneous reports. This comparison demonstrates the inadequacy of the retrieval method in handling complex clinical scenarios.

Vision-Language Generative Model for View-Specific Chest X-ray Generation

TL;DR

Abstract

Vision-Language Generative Model for View-Specific Chest X-ray Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)