Table of Contents
Fetching ...

Beyond Generative AI: World Models for Clinical Prediction, Counterfactuals, and Planning

Mohammad Areeb Qazi, Maryam Nadeem, Mohammad Yaqub

TL;DR

Healthcare AI must be predictive, reliable, and data-efficient, but large generative models often lack physical grounding and temporal decision-making. The paper surveys world models that learn predictive dynamics and latent representations to enable internal simulations, counterfactual evaluation, and planning across imaging, EHR, and robotic surgery, leveraging backbones like transformers, diffusion, and VAEs with JEPA-style objectives (e.g., $p(s_{t+1} \mid s_t, a_t)$ or future-latent predictions). It introduces a four-level capability rubric (L1-L4), synthesizes current work, and identifies gaps such as underspecified action spaces and weak interventional validation, outlining a research agenda to integrate causal/mechanical priors with generative backbones. The goal is to shift from generation to prediction-first world models that support safe, decision-focused clinical planning, with standardized evaluation, tooling, and governance to enable closed-loop decision support in healthcare.

Abstract

Healthcare requires AI that is predictive, reliable, and data-efficient. However, recent generative models lack physical foundation and temporal reasoning required for clinical decision support. As scaling language models show diminishing returns for grounded clinical reasoning, world models are gaining traction because they learn multimodal, temporally coherent, and action-conditioned representations that reflect the physical and causal structure of care. This paper reviews World Models for healthcare systems that learn predictive dynamics to enable multistep rollouts, counterfactual evaluation and planning. We survey recent work across three domains: (i) medical imaging and diagnostics (e.g., longitudinal tumor simulation, projection-transition modeling, and Joint Embedding Predictive Architecture i.e., JEPA-style predictive representation learning), (ii) disease progression modeling from electronic health records (generative event forecasting at scale), and (iii) robotic surgery and surgical planning (action-conditioned guidance and control). We also introduce a capability rubric: L1 temporal prediction, L2 action-conditioned prediction, L3 counterfactual rollouts for decision support, and L4 planning/control. Most reviewed systems achieve L1--L2, with fewer instances of L3 and rare L4. We identify cross-cutting gaps that limit clinical reliability; under-specified action spaces and safety constraints, weak interventional validation, incomplete multimodal state construction, and limited trajectory-level uncertainty calibration. This review outlines a research agenda for clinically robust prediction-first world models that integrate generative backbones (transformers, diffusion, VAE) with causal/mechanical foundation for safe decision support in healthcare.

Beyond Generative AI: World Models for Clinical Prediction, Counterfactuals, and Planning

TL;DR

Healthcare AI must be predictive, reliable, and data-efficient, but large generative models often lack physical grounding and temporal decision-making. The paper surveys world models that learn predictive dynamics and latent representations to enable internal simulations, counterfactual evaluation, and planning across imaging, EHR, and robotic surgery, leveraging backbones like transformers, diffusion, and VAEs with JEPA-style objectives (e.g., or future-latent predictions). It introduces a four-level capability rubric (L1-L4), synthesizes current work, and identifies gaps such as underspecified action spaces and weak interventional validation, outlining a research agenda to integrate causal/mechanical priors with generative backbones. The goal is to shift from generation to prediction-first world models that support safe, decision-focused clinical planning, with standardized evaluation, tooling, and governance to enable closed-loop decision support in healthcare.

Abstract

Healthcare requires AI that is predictive, reliable, and data-efficient. However, recent generative models lack physical foundation and temporal reasoning required for clinical decision support. As scaling language models show diminishing returns for grounded clinical reasoning, world models are gaining traction because they learn multimodal, temporally coherent, and action-conditioned representations that reflect the physical and causal structure of care. This paper reviews World Models for healthcare systems that learn predictive dynamics to enable multistep rollouts, counterfactual evaluation and planning. We survey recent work across three domains: (i) medical imaging and diagnostics (e.g., longitudinal tumor simulation, projection-transition modeling, and Joint Embedding Predictive Architecture i.e., JEPA-style predictive representation learning), (ii) disease progression modeling from electronic health records (generative event forecasting at scale), and (iii) robotic surgery and surgical planning (action-conditioned guidance and control). We also introduce a capability rubric: L1 temporal prediction, L2 action-conditioned prediction, L3 counterfactual rollouts for decision support, and L4 planning/control. Most reviewed systems achieve L1--L2, with fewer instances of L3 and rare L4. We identify cross-cutting gaps that limit clinical reliability; under-specified action spaces and safety constraints, weak interventional validation, incomplete multimodal state construction, and limited trajectory-level uncertainty calibration. This review outlines a research agenda for clinically robust prediction-first world models that integrate generative backbones (transformers, diffusion, VAE) with causal/mechanical foundation for safe decision support in healthcare.

Paper Structure

This paper contains 5 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Conceptual schematic of world models for healthcare. Multimodal clinical inputs are encoded into a latent state; a latent dynamics predictor models transitions $p(s_{t+1}\mid s_t, a_t)$; predicted futures support downstream tasks across imaging, EHR trajectories, drug discovery, surgical robotics, and digital twins. The JEPA-style objective (bottom) trains a predictor $f_{\theta}$ to map context embeddings $E_{\mathrm{ctx}}(x_{\mathrm{ctx}})$ and latent dynamics $z$ to match target embeddings $E_{\mathrm{tgt}}(x_{\mathrm{tgt}})$.
  • Figure 2: Capability map of reviewed papers across four levels: L1 (temporal prediction), L2 (action-conditioned prediction), L3 (counterfactual rollouts for decision support), L4 (planning/control). Colors denote domains (Imaging, EHR, Robotics). Dashed borders indicate methods spanning adjacent levels (e.g., TaDiff and Surgical Vision WM at L2--L3).