Table of Contents
Fetching ...

PRIME-CVD: A Parametrically Rendered Informatics Medical Environment for Education in Cardiovascular Risk Modelling

Nicholas I-Hsien Kuo, Marzia Hoque Tania, Blanca Gallego, Louisa Jorm

Abstract

In recent years, progress in medical informatics and machine learning has been accelerated by the availability of openly accessible benchmark datasets. However, patient-level electronic medical record (EMR) data are rarely available for teaching or methodological development due to privacy, governance, and re-identification risks. This has limited reproducibility, transparency, and hands-on training in cardiovascular risk modelling. Here we introduce PRIME-CVD, a parametrically rendered informatics medical environment designed explicitly for medical education. PRIME-CVD comprises two openly accessible synthetic data assets representing a cohort of 50,000 adults undergoing primary prevention for cardiovascular disease. The datasets are generated entirely from a user-specified causal directed acyclic graph parameterised using publicly available Australian population statistics and published epidemiologic effect estimates, rather than from patient-level EMR data or trained generative models. Data Asset 1 provides a clean, analysis-ready cohort suitable for exploratory analysis, stratification, and survival modelling, while Data Asset 2 restructures the same cohort into a relational, EMR-style database with realistic structural and lexical heterogeneity. Together, these assets enable instruction in data cleaning, harmonisation, causal reasoning, and policy-relevant risk modelling without exposing sensitive information. Because all individuals and events are generated de novo, PRIME-CVD preserves realistic subgroup imbalance and risk gradients while ensuring negligible disclosure risk. PRIME-CVD is released under a Creative Commons Attribution 4.0 licence to support reproducible research and scalable medical education.

PRIME-CVD: A Parametrically Rendered Informatics Medical Environment for Education in Cardiovascular Risk Modelling

Abstract

In recent years, progress in medical informatics and machine learning has been accelerated by the availability of openly accessible benchmark datasets. However, patient-level electronic medical record (EMR) data are rarely available for teaching or methodological development due to privacy, governance, and re-identification risks. This has limited reproducibility, transparency, and hands-on training in cardiovascular risk modelling. Here we introduce PRIME-CVD, a parametrically rendered informatics medical environment designed explicitly for medical education. PRIME-CVD comprises two openly accessible synthetic data assets representing a cohort of 50,000 adults undergoing primary prevention for cardiovascular disease. The datasets are generated entirely from a user-specified causal directed acyclic graph parameterised using publicly available Australian population statistics and published epidemiologic effect estimates, rather than from patient-level EMR data or trained generative models. Data Asset 1 provides a clean, analysis-ready cohort suitable for exploratory analysis, stratification, and survival modelling, while Data Asset 2 restructures the same cohort into a relational, EMR-style database with realistic structural and lexical heterogeneity. Together, these assets enable instruction in data cleaning, harmonisation, causal reasoning, and policy-relevant risk modelling without exposing sensitive information. Because all individuals and events are generated de novo, PRIME-CVD preserves realistic subgroup imbalance and risk gradients while ensuring negligible disclosure risk. PRIME-CVD is released under a Creative Commons Attribution 4.0 licence to support reproducible research and scalable medical education.
Paper Structure (39 sections, 15 equations, 12 figures, 16 tables)

This paper contains 39 sections, 15 equations, 12 figures, 16 tables.

Figures (12)

  • Figure 1: Baseline characteristics of the PRIME-CVD cohort used for cardiovascular risk simulation.
  • Figure 2: Directed acyclic graph representing the causal structure embedded in PRIME-CVD. Circular and rectangular nodes denote numeric and categorical variables, respectively; blue, orange, and red represent demographic/lifestyle determinants, chronic diseases, and anthropometric/physiological measurements.
  • Figure 3: Constructing the PRIME-CVD EMR-style Data Asset 2. The clean DAG-generated cohort is split into three relational tables and augmented with realistic EMR artefacts including, missingness, ID scrambling, heterogeneous terminology, and mixed units.
  • Figure 4: Socioeconomic distributions for mutually exclusive CKD and T2DM cohorts reconstructed from the relational Data Asset 2. Bar plots summarise the prevalence of IRSD quintiles within each cohort following linkage and harmonisation of diagnosis records, enabling comparison of socioeconomic profiles across conditions.
  • Figure 5: IRSD-stratified distributions of variables in Data Asset 1.
  • ...and 7 more figures