Beyond Generative AI: World Models for Clinical Prediction, Counterfactuals, and Planning
Mohammad Areeb Qazi, Maryam Nadeem, Mohammad Yaqub
TL;DR
Healthcare AI must be predictive, reliable, and data-efficient, but large generative models often lack physical grounding and temporal decision-making. The paper surveys world models that learn predictive dynamics and latent representations to enable internal simulations, counterfactual evaluation, and planning across imaging, EHR, and robotic surgery, leveraging backbones like transformers, diffusion, and VAEs with JEPA-style objectives (e.g., $p(s_{t+1} \mid s_t, a_t)$ or future-latent predictions). It introduces a four-level capability rubric (L1-L4), synthesizes current work, and identifies gaps such as underspecified action spaces and weak interventional validation, outlining a research agenda to integrate causal/mechanical priors with generative backbones. The goal is to shift from generation to prediction-first world models that support safe, decision-focused clinical planning, with standardized evaluation, tooling, and governance to enable closed-loop decision support in healthcare.
Abstract
Healthcare requires AI that is predictive, reliable, and data-efficient. However, recent generative models lack physical foundation and temporal reasoning required for clinical decision support. As scaling language models show diminishing returns for grounded clinical reasoning, world models are gaining traction because they learn multimodal, temporally coherent, and action-conditioned representations that reflect the physical and causal structure of care. This paper reviews World Models for healthcare systems that learn predictive dynamics to enable multistep rollouts, counterfactual evaluation and planning. We survey recent work across three domains: (i) medical imaging and diagnostics (e.g., longitudinal tumor simulation, projection-transition modeling, and Joint Embedding Predictive Architecture i.e., JEPA-style predictive representation learning), (ii) disease progression modeling from electronic health records (generative event forecasting at scale), and (iii) robotic surgery and surgical planning (action-conditioned guidance and control). We also introduce a capability rubric: L1 temporal prediction, L2 action-conditioned prediction, L3 counterfactual rollouts for decision support, and L4 planning/control. Most reviewed systems achieve L1--L2, with fewer instances of L3 and rare L4. We identify cross-cutting gaps that limit clinical reliability; under-specified action spaces and safety constraints, weak interventional validation, incomplete multimodal state construction, and limited trajectory-level uncertainty calibration. This review outlines a research agenda for clinically robust prediction-first world models that integrate generative backbones (transformers, diffusion, VAE) with causal/mechanical foundation for safe decision support in healthcare.
