Table of Contents
Fetching ...

An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases

Shaheer Ahmad Khan, Muhammad Usamah Shahid, Ahmad Abdullah, Ibrahim Hashmat, Muddassar Farooq

TL;DR

The paper tackles the need for early, explainable multi-disease risk surveillance using routine EHR data to forecast risks up to $12$ months ahead for eight chronic diseases. It develops Random Forest–based predictors for each disease and horizon ($3$, $6$, $12$ months) without relying on laboratory tests, augmented with SHAP explanations and a novel rule-engineering framework for interpretable rules. AUROC exceeds $0.80$ across diseases and horizons, and F1 scores are typically above $0.75$, with some decline for $12$-month hypertension and hyperlipidemia. The system is designed for practical EMR integration via a local API using common protocols like FHIR and CPU-based deployment, enabling wide, cost-effective adoption. This work supports proactive, value-based care and offers a path toward Real-World Evidence generation for health systems.

Abstract

This study addresses a critical gap in the healthcare system by developing a clinically meaningful, practical, and explainable disease surveillance system for multiple chronic diseases, utilizing routine EHR data from multiple U.S. practices integrated with CureMD's EMR/EHR system. Unlike traditional systems--using AI models that rely on features from patients' labs--our approach focuses on routinely available data, such as medical history, vitals, diagnoses, and medications, to preemptively assess the risks of chronic diseases in the next year. We trained three distinct models for each chronic disease: prediction models that forecast the risk of a disease 3, 6, and 12 months before a potential diagnosis. We developed Random Forest models, which were internally validated using F1 scores and AUROC as performance metrics and further evaluated by a panel of expert physicians for clinical relevance based on inferences grounded in medical knowledge. Additionally, we discuss our implementation of integrating these models into a practical EMR system. Beyond using Shapley attributes and surrogate models for explainability, we also introduce a new rule-engineering framework to enhance the intrinsic explainability of Random Forests.

An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases

TL;DR

The paper tackles the need for early, explainable multi-disease risk surveillance using routine EHR data to forecast risks up to months ahead for eight chronic diseases. It develops Random Forest–based predictors for each disease and horizon (, , months) without relying on laboratory tests, augmented with SHAP explanations and a novel rule-engineering framework for interpretable rules. AUROC exceeds across diseases and horizons, and F1 scores are typically above , with some decline for -month hypertension and hyperlipidemia. The system is designed for practical EMR integration via a local API using common protocols like FHIR and CPU-based deployment, enabling wide, cost-effective adoption. This work supports proactive, value-based care and offers a path toward Real-World Evidence generation for health systems.

Abstract

This study addresses a critical gap in the healthcare system by developing a clinically meaningful, practical, and explainable disease surveillance system for multiple chronic diseases, utilizing routine EHR data from multiple U.S. practices integrated with CureMD's EMR/EHR system. Unlike traditional systems--using AI models that rely on features from patients' labs--our approach focuses on routinely available data, such as medical history, vitals, diagnoses, and medications, to preemptively assess the risks of chronic diseases in the next year. We trained three distinct models for each chronic disease: prediction models that forecast the risk of a disease 3, 6, and 12 months before a potential diagnosis. We developed Random Forest models, which were internally validated using F1 scores and AUROC as performance metrics and further evaluated by a panel of expert physicians for clinical relevance based on inferences grounded in medical knowledge. Additionally, we discuss our implementation of integrating these models into a practical EMR system. Beyond using Shapley attributes and surrogate models for explainability, we also introduce a new rule-engineering framework to enhance the intrinsic explainability of Random Forests.

Paper Structure

This paper contains 10 sections, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Data preprocessing and model development workflow. Automated pipelines are constructed that require human input via the three configuration steps.
  • Figure 2: Overall workflow of the prediction API for multiple chronic diseases.
  • Figure 3: Explanations for CHD prediction for a patient as presented to the user. For this patient, their age and BMI are reducing the risk, while their diagnoses of hypertension (I10), Dyslipidemia (E78), and COPD (J44) are increasing the risk.