Towards Predicting Temporal Changes in a Patient's Chest X-ray Images based on Electronic Health Records

Daeun Kyung; Junu Kim; Tackeun Kim; Edward Choi

Towards Predicting Temporal Changes in a Patient's Chest X-ray Images based on Electronic Health Records

Daeun Kyung, Junu Kim, Tackeun Kim, Edward Choi

TL;DR

This work tackles predicting patient-specific temporal changes in chest X-ray (CXR) imaging by conditioning a latent diffusion model on both a prior CXR $I_{prev}$ and a sequence of electronic health record events $\mathcal{S}_{event}$. The EHRXDiff framework integrates fine-grained visual structure via a VAE, high-level clinical context via image and tabular CLIP encoders, and a multimodal adapter using cross-attention to fuse modalities. Through extensive evaluation on MIMIC-IV and MIMIC-CXR-JPG data, the approach demonstrates strong preservation of medical findings, demographic attributes, and high visual fidelity for future CXRs, outperforming baselines in modeling temporal changes, especially in the Diff subset where pathology evolves. The results suggest potential clinical utility for monitoring disease progression and informing treatment planning, with future work including long-term forecasting and additional data modalities. Mathematically, the task is defined as predicting $I_{trg}$ from $(I_{prev},\mathcal{S}_{event})$, using a latent diffusion process with conditioning $\tau_{\phi}(\mathbf{y}) = \ tau_{\phi}(I_{prev}, \mathcal{S}_{event})$, applied within a cross-attention–based fusion framework that aligns VAE latents and CLIP embeddings to guide denoising steps.

Abstract

Chest X-ray (CXR) is an important diagnostic tool widely used in hospitals to assess patient conditions and monitor changes over time. Recently, generative models, specifically diffusion-based models, have shown promise in generating realistic synthetic CXRs. However, these models mainly focus on conditional generation using single-time-point data, i.e., generating CXRs conditioned on their corresponding reports from a specific time. This limits their clinical utility, particularly for capturing temporal changes. To address this limitation, we propose a novel framework, EHRXDiff, which predicts future CXR images by integrating previous CXRs with subsequent medical events, e.g., prescriptions, lab measures, etc. Our framework dynamically tracks and predicts disease progression based on a latent diffusion model, conditioned on the previous CXR image and a history of medical events. We comprehensively evaluate the performance of our framework across three key aspects, including clinical consistency, demographic consistency, and visual realism. Results show that our framework generates high-quality, realistic future images that effectively capture potential temporal changes. This suggests that our framework could be further developed to support clinical decision-making and provide valuable insights for patient monitoring and treatment planning in the medical field. The code is available at https://github.com/dek924/EHRXDiff.

Towards Predicting Temporal Changes in a Patient's Chest X-ray Images based on Electronic Health Records

TL;DR

This work tackles predicting patient-specific temporal changes in chest X-ray (CXR) imaging by conditioning a latent diffusion model on both a prior CXR

and a sequence of electronic health record events

. The EHRXDiff framework integrates fine-grained visual structure via a VAE, high-level clinical context via image and tabular CLIP encoders, and a multimodal adapter using cross-attention to fuse modalities. Through extensive evaluation on MIMIC-IV and MIMIC-CXR-JPG data, the approach demonstrates strong preservation of medical findings, demographic attributes, and high visual fidelity for future CXRs, outperforming baselines in modeling temporal changes, especially in the Diff subset where pathology evolves. The results suggest potential clinical utility for monitoring disease progression and informing treatment planning, with future work including long-term forecasting and additional data modalities. Mathematically, the task is defined as predicting

from

, using a latent diffusion process with conditioning

, applied within a cross-attention–based fusion framework that aligns VAE latents and CLIP embeddings to guide denoising steps.

Abstract

Paper Structure (59 sections, 3 equations, 6 figures, 14 tables)

This paper contains 59 sections, 3 equations, 6 figures, 14 tables.

Introduction
Related Works
Generative Models for CXR Imaging
Longitudinal CXR Imaging
Multimodal Fusion for Imaging and EHR Tabular Events in Clinical Applications
Methodology
Task Definition
Background: Conditional Latent Diffusion Model
Model Architecture
Encoder Modules
VAE Encoder
CLIP Encoders
Adapter for Multimodal Fusion
Data Augmentation
Experiments
...and 44 more sections

Figures (6)

Figure 1: Task overview. EHR containing a patient's medical history occurred within the hospital, including structured data (e.g., charts, medications, microbiology events) and unstructured records (e.g., CXR). Our task is to predict subsequent CXR image based on prior CXR image and the associated medical history after their acquisition.
Figure 2: Overall framework. During training, a random timestep $t$ is sampled, and the latent vector $z$ is corrupted to $z_t$ via diffusion process. For inference, Gaussian noise is sampled and iteratively denoised over $T$ steps. In both cases, image embeddings from CLIP and VAE encoders ($E_{CLIP}^{img}$, $E_{\mathrm{VAE}}$) and table embeddings from CLIP table encoder ($E_{CLIP}^{tab}$) are fused by an adaptor module to condition the denoising U-Net.
Figure 3: Detail of adapter module $A_{fusion}$. This module merges the embeddings from the VAE and CLIP encoders. To simplify the process, we project the VAE embeddings to match the dimensionality of the CLIP features $D_{CLIP}$ before fusing.
Figure 4: Qualitative results. $I_{prev}$ and $I_{trg}$ are real CXR images from two different timestamps, while $I_{pred}$ is predicted by $\text{EHRXDiff}_{w\_null}$. $\mathcal{S}_{event}$ (below the arrows) represents the medical events between the two timepoints, with Bold indicating descriptions of the differences between the CXRs.
Figure 5: Qualitative results. $I_{prev}$ and $I_{trg}$ are real CXR images from two different timestamps, while $I_{pred}$ is predicted by $\text{EHRXDiff}_{w\_null}$. $\mathcal{S}_{event}$ (shown below the arrows) represents the medical events between the two timepoints, with Bold indicating descriptions of the differences between the GT CXRs.
...and 1 more figures

Towards Predicting Temporal Changes in a Patient's Chest X-ray Images based on Electronic Health Records

TL;DR

Abstract

Towards Predicting Temporal Changes in a Patient's Chest X-ray Images based on Electronic Health Records

Authors

TL;DR

Abstract

Table of Contents

Figures (6)