Table of Contents
Fetching ...

Advancing Text-Driven Chest X-Ray Generation with Policy-Based Reinforcement Learning

Woojung Han, Chanyoung Kim, Dayun Ju, Yumin Shim, Seong Jae Hwang

TL;DR

This work tackles the challenge of generating chest X-rays that faithfully reflect diagnostic reports by framing text-to-CXR synthesis as a reinforcement-learning problem over diffusion models. The authors introduce CXRL, which jointly tunes the image generator and learnable adaptive condition embeddings (ACE) under a Reinforcement Learning with Comparative Feedback (RLCF) scheme, guided by three domain-specific rewards: posture alignment, diagnostic condition accuracy, and multimodal consistency with the input report. By evaluating on the MIMIC-CXR-JPG dataset and conducting extensive ablations, the study demonstrates that ACE and RLCF yield superior CXR fidelity, pathology representation, and report alignment compared to baselines, achieving pathologically realistic images suitable for education, privacy-preserving data synthesis, and clinical research. The proposed adaptive reward framework and policy-gradient optimization provide a scalable pathway to improve medical image synthesis in challenging environments where absolute quality is hard to quantify, with potential broad impact on clinical decision support and medical training.

Abstract

Recent advances in text-conditioned image generation diffusion models have begun paving the way for new opportunities in modern medical domain, in particular, generating Chest X-rays (CXRs) from diagnostic reports. Nonetheless, to further drive the diffusion models to generate CXRs that faithfully reflect the complexity and diversity of real data, it has become evident that a nontrivial learning approach is needed. In light of this, we propose CXRL, a framework motivated by the potential of reinforcement learning (RL). Specifically, we integrate a policy gradient RL approach with well-designed multiple distinctive CXR-domain specific reward models. This approach guides the diffusion denoising trajectory, achieving precise CXR posture and pathological details. Here, considering the complex medical image environment, we present "RL with Comparative Feedback" (RLCF) for the reward mechanism, a human-like comparative evaluation that is known to be more effective and reliable in complex scenarios compared to direct evaluation. Our CXRL framework includes jointly optimizing learnable adaptive condition embeddings (ACE) and the image generator, enabling the model to produce more accurate and higher perceptual CXR quality. Our extensive evaluation of the MIMIC-CXR-JPG dataset demonstrates the effectiveness of our RL-based tuning approach. Consequently, our CXRL generates pathologically realistic CXRs, establishing a new standard for generating CXRs with high fidelity to real-world clinical scenarios.

Advancing Text-Driven Chest X-Ray Generation with Policy-Based Reinforcement Learning

TL;DR

This work tackles the challenge of generating chest X-rays that faithfully reflect diagnostic reports by framing text-to-CXR synthesis as a reinforcement-learning problem over diffusion models. The authors introduce CXRL, which jointly tunes the image generator and learnable adaptive condition embeddings (ACE) under a Reinforcement Learning with Comparative Feedback (RLCF) scheme, guided by three domain-specific rewards: posture alignment, diagnostic condition accuracy, and multimodal consistency with the input report. By evaluating on the MIMIC-CXR-JPG dataset and conducting extensive ablations, the study demonstrates that ACE and RLCF yield superior CXR fidelity, pathology representation, and report alignment compared to baselines, achieving pathologically realistic images suitable for education, privacy-preserving data synthesis, and clinical research. The proposed adaptive reward framework and policy-gradient optimization provide a scalable pathway to improve medical image synthesis in challenging environments where absolute quality is hard to quantify, with potential broad impact on clinical decision support and medical training.

Abstract

Recent advances in text-conditioned image generation diffusion models have begun paving the way for new opportunities in modern medical domain, in particular, generating Chest X-rays (CXRs) from diagnostic reports. Nonetheless, to further drive the diffusion models to generate CXRs that faithfully reflect the complexity and diversity of real data, it has become evident that a nontrivial learning approach is needed. In light of this, we propose CXRL, a framework motivated by the potential of reinforcement learning (RL). Specifically, we integrate a policy gradient RL approach with well-designed multiple distinctive CXR-domain specific reward models. This approach guides the diffusion denoising trajectory, achieving precise CXR posture and pathological details. Here, considering the complex medical image environment, we present "RL with Comparative Feedback" (RLCF) for the reward mechanism, a human-like comparative evaluation that is known to be more effective and reliable in complex scenarios compared to direct evaluation. Our CXRL framework includes jointly optimizing learnable adaptive condition embeddings (ACE) and the image generator, enabling the model to produce more accurate and higher perceptual CXR quality. Our extensive evaluation of the MIMIC-CXR-JPG dataset demonstrates the effectiveness of our RL-based tuning approach. Consequently, our CXRL generates pathologically realistic CXRs, establishing a new standard for generating CXRs with high fidelity to real-world clinical scenarios.
Paper Structure (14 sections, 2 equations, 5 figures, 3 tables)

This paper contains 14 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The overview of CXRL. Our model employs policy gradient optimization utilizing multi-reward feedbacks, fine-tuning image generator and ACE to produce realistic and accurate CXR that corresponds closely to the input report.
  • Figure 1: Qualitative ablation on each reward models. (a): CXRL shows significantly better alignment of the clavicle and costophrenic angle compared to anchor regarding posture alignment. (b): CXRL demonstrates improved predictive diagnostic accuracy, closely matching the GT and enhancing clinical decision-making. (c): The multimodal consistency reward ensures that CXRs and reports correspond well, as observed by arrows and text in matching colors. Conversely, anchor sometimes generates content in the report that is either irrelevant or should not be present. Specifically, in the second row, the presence of edema is verifiable within the red circle, indicating misalignment between the CXR and the report.
  • Figure 2: A detailed illustration of our reward feedback models. We incorporate three different feedbacks for report-to-CXR generation model to generate goal-oriented CXRs.
  • Figure 2: Additional qualitative results of our framework comparing against baselines. The colored texts match their corresponding colored arrows. Ours w/o ACE or RLCF demonstrates superior report agreement and posture alignment compared to other baselines, and CXRL is observed to generate more advanced high-fidelity CXRs that highlight our methodology's effectiveness in synthesizing clinically accurate medical images.
  • Figure 3: Comparison between previous state-of-the-art report-to-CXR generation models lee2024llmcxrchambon2022roentgen and ours. The blue and green texts match their corresponding colored arrows.