Towards Predicting Temporal Changes in a Patient's Chest X-ray Images based on Electronic Health Records
Daeun Kyung, Junu Kim, Tackeun Kim, Edward Choi
TL;DR
This work tackles predicting patient-specific temporal changes in chest X-ray (CXR) imaging by conditioning a latent diffusion model on both a prior CXR $I_{prev}$ and a sequence of electronic health record events $\mathcal{S}_{event}$. The EHRXDiff framework integrates fine-grained visual structure via a VAE, high-level clinical context via image and tabular CLIP encoders, and a multimodal adapter using cross-attention to fuse modalities. Through extensive evaluation on MIMIC-IV and MIMIC-CXR-JPG data, the approach demonstrates strong preservation of medical findings, demographic attributes, and high visual fidelity for future CXRs, outperforming baselines in modeling temporal changes, especially in the Diff subset where pathology evolves. The results suggest potential clinical utility for monitoring disease progression and informing treatment planning, with future work including long-term forecasting and additional data modalities. Mathematically, the task is defined as predicting $I_{trg}$ from $(I_{prev},\mathcal{S}_{event})$, using a latent diffusion process with conditioning $\tau_{\phi}(\mathbf{y}) = \ tau_{\phi}(I_{prev}, \mathcal{S}_{event})$, applied within a cross-attention–based fusion framework that aligns VAE latents and CLIP embeddings to guide denoising steps.
Abstract
Chest X-ray (CXR) is an important diagnostic tool widely used in hospitals to assess patient conditions and monitor changes over time. Recently, generative models, specifically diffusion-based models, have shown promise in generating realistic synthetic CXRs. However, these models mainly focus on conditional generation using single-time-point data, i.e., generating CXRs conditioned on their corresponding reports from a specific time. This limits their clinical utility, particularly for capturing temporal changes. To address this limitation, we propose a novel framework, EHRXDiff, which predicts future CXR images by integrating previous CXRs with subsequent medical events, e.g., prescriptions, lab measures, etc. Our framework dynamically tracks and predicts disease progression based on a latent diffusion model, conditioned on the previous CXR image and a history of medical events. We comprehensively evaluate the performance of our framework across three key aspects, including clinical consistency, demographic consistency, and visual realism. Results show that our framework generates high-quality, realistic future images that effectively capture potential temporal changes. This suggests that our framework could be further developed to support clinical decision-making and provide valuable insights for patient monitoring and treatment planning in the medical field. The code is available at https://github.com/dek924/EHRXDiff.
