Table of Contents
Fetching ...

Diff-CXR: Report-to-CXR generation through a disease-knowledge enhanced diffusion model

Peng Huang, Bowen Guo, Shuyu Liang, Junhu Fu, Yuanyuan Wang, Yi Guo

TL;DR

A novel disease-knowledge enhanced Diffusion-based TTI learning framework, named Diff-CXR, for medical report-to-CXR generation that outperforms previous SOTA medical TTI methods and can achieve a competitive mAUC score compared to models trained on all data, presenting promising clinical applications.

Abstract

Text-To-Image (TTI) generation is significant for controlled and diverse image generation with broad potential applications. Although current medical TTI methods have made some progress in report-to-Chest-Xray (CXR) generation, their generation performance may be limited due to the intrinsic characteristics of medical data. In this paper, we propose a novel disease-knowledge enhanced Diffusion-based TTI learning framework, named Diff-CXR, for medical report-to-CXR generation. First, to minimize the negative impacts of noisy data on generation, we devise a Latent Noise Filtering Strategy that gradually learns the general patterns of anomalies and removes them in the latent space. Then, an Adaptive Vision-Aware Textual Learning Strategy is designed to learn concise and important report embeddings in a domain-specific Vision-Language Model, providing textual guidance for Chest-Xray generation. Finally, by incorporating the general disease knowledge into the pretrained TTI model via a delicate control adapter, a disease-knowledge enhanced diffusion model is introduced to achieve realistic and precise report-to-CXR generation. Experimentally, our Diff-CXR outperforms previous SOTA medical TTI methods by 33.4\% / 8.0\% and 23.8\% / 56.4\% in the FID and mAUC score on MIMIC-CXR and IU-Xray, with the lowest computational complexity at 29.641 GFLOPs. Downstream experiments on three thorax disease classification benchmarks and one CXR-report generation benchmark demonstrate that Diff-CXR is effective in improving classical CXR analysis methods. Notably, models trained on the combination of 1\% real data and synthetic data can achieve a competitive mAUC score compared to models trained on all data, presenting promising clinical applications.

Diff-CXR: Report-to-CXR generation through a disease-knowledge enhanced diffusion model

TL;DR

A novel disease-knowledge enhanced Diffusion-based TTI learning framework, named Diff-CXR, for medical report-to-CXR generation that outperforms previous SOTA medical TTI methods and can achieve a competitive mAUC score compared to models trained on all data, presenting promising clinical applications.

Abstract

Text-To-Image (TTI) generation is significant for controlled and diverse image generation with broad potential applications. Although current medical TTI methods have made some progress in report-to-Chest-Xray (CXR) generation, their generation performance may be limited due to the intrinsic characteristics of medical data. In this paper, we propose a novel disease-knowledge enhanced Diffusion-based TTI learning framework, named Diff-CXR, for medical report-to-CXR generation. First, to minimize the negative impacts of noisy data on generation, we devise a Latent Noise Filtering Strategy that gradually learns the general patterns of anomalies and removes them in the latent space. Then, an Adaptive Vision-Aware Textual Learning Strategy is designed to learn concise and important report embeddings in a domain-specific Vision-Language Model, providing textual guidance for Chest-Xray generation. Finally, by incorporating the general disease knowledge into the pretrained TTI model via a delicate control adapter, a disease-knowledge enhanced diffusion model is introduced to achieve realistic and precise report-to-CXR generation. Experimentally, our Diff-CXR outperforms previous SOTA medical TTI methods by 33.4\% / 8.0\% and 23.8\% / 56.4\% in the FID and mAUC score on MIMIC-CXR and IU-Xray, with the lowest computational complexity at 29.641 GFLOPs. Downstream experiments on three thorax disease classification benchmarks and one CXR-report generation benchmark demonstrate that Diff-CXR is effective in improving classical CXR analysis methods. Notably, models trained on the combination of 1\% real data and synthetic data can achieve a competitive mAUC score compared to models trained on all data, presenting promising clinical applications.

Paper Structure

This paper contains 27 sections, 7 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Illustration of characteristics of medical data which are essential to report-to-CXR generation process. In detail, (a) visualizes several noisy examples. (b) shows a long report example, which should be padded to meet the maximum textual token limits, 256, during the encoding process of VLMs. (c) illustrates that patients with different diseases may have same image manifestations. Disease information and imaging characteristics are highlighted in red for given reports.
  • Figure 2: The overview of Diff-CXR. Given the training set, the latent noise filtering strategy effectively removes those noisy images with their reports in the latent space in a coarse-to-fine manner. Then, within the pruned dataset, the adaptive vision-aware textual learning strategy prompts the domain-specific language model to learn visually relevant and concise textual guidance for image generation. Finally, the disease-knowledge enhanced diffusion model is trained in two stages. In the vanilla diffusion process, only textual embeddings will be conditioned on the diffusion model. The disease-specific knowledge is further extracted and injected via a control adapter to refine generation results gradually.
  • Figure 3: Diagram of the latent noise filtering strategy. Latent clustering identifies the typical normal data across different clusters, and manifold modeling attempts to capture the underlying structure of these normal data by reconstruction, thus identifying those typical noisy data by the reconstruction error. Finally, explicit supervision combines the typical normal and noisy data to train a discriminator, achieving a more robust and accurate detection result.
  • Figure 4: Illustration of AVA-TLS. (a) presents a medical report example, where CXR-related tokens are highlighted. The report should also be padded by useless tokens to meet the default token limits during tokenization. (b) denotes the structure of our information squeeze module.
  • Figure 5: The illustration of control adapter. Original weights of the denoising model are fixed. The trainable control adapter, which consists of the encoder block and middle block copies, takes the disease knowledge embedding as input and interfaces with the fixed denoising model with the zero linear layers.
  • ...and 4 more figures