Table of Contents
Fetching ...

Evaluating Machine Learning Models against Clinical Protocols for Enhanced Interpretability and Continuity of Care

Christel Sirocchi, Muhammad Suffian, Federico Sabbatini, Alessandro Bogliolo, Sara Montagna

TL;DR

The integrated ML model is demonstrated that the integrated model achieves comparable performance to that of a fully data-driven model while exhibiting superior accuracy relative to the clinical protocol, ensuring enhanced continuity of care.

Abstract

In clinical practice, decision-making relies heavily on established protocols, often formalised as rules. Concurrently, Machine Learning (ML) models, trained on clinical data, aspire to integrate into medical decision-making processes. However, despite the growing number of ML applications, their adoption into clinical practice remains limited. Two critical concerns arise, relevant to the notions of consistency and continuity of care: (a) accuracy - the ML model, albeit more accurate, might introduce errors that would not have occurred by applying the protocol; (b) interpretability - ML models operating as black boxes might make predictions based on relationships that contradict established clinical knowledge. In this context, the literature suggests using ML models integrating domain knowledge for improved accuracy and interpretability. However, there is a lack of appropriate metrics for comparing ML models with clinical rules in addressing these challenges. Accordingly, in this article, we first propose metrics to assess the accuracy of ML models with respect to the established protocol. Secondly, we propose an approach to measure the distance of explanations provided by two rule sets, with the goal of comparing the explanation similarity between clinical rule-based systems and rules extracted from ML models. The approach is validated on the Pima Indians Diabetes dataset by training two neural networks - one exclusively on data, and the other integrating a clinical protocol. Our findings demonstrate that the integrated ML model achieves comparable performance to that of a fully data-driven model while exhibiting superior accuracy relative to the clinical protocol, ensuring enhanced continuity of care. Furthermore, we show that our integrated model provides explanations for predictions that align more closely with the clinical protocol compared to the data-driven model.

Evaluating Machine Learning Models against Clinical Protocols for Enhanced Interpretability and Continuity of Care

TL;DR

The integrated ML model is demonstrated that the integrated model achieves comparable performance to that of a fully data-driven model while exhibiting superior accuracy relative to the clinical protocol, ensuring enhanced continuity of care.

Abstract

In clinical practice, decision-making relies heavily on established protocols, often formalised as rules. Concurrently, Machine Learning (ML) models, trained on clinical data, aspire to integrate into medical decision-making processes. However, despite the growing number of ML applications, their adoption into clinical practice remains limited. Two critical concerns arise, relevant to the notions of consistency and continuity of care: (a) accuracy - the ML model, albeit more accurate, might introduce errors that would not have occurred by applying the protocol; (b) interpretability - ML models operating as black boxes might make predictions based on relationships that contradict established clinical knowledge. In this context, the literature suggests using ML models integrating domain knowledge for improved accuracy and interpretability. However, there is a lack of appropriate metrics for comparing ML models with clinical rules in addressing these challenges. Accordingly, in this article, we first propose metrics to assess the accuracy of ML models with respect to the established protocol. Secondly, we propose an approach to measure the distance of explanations provided by two rule sets, with the goal of comparing the explanation similarity between clinical rule-based systems and rules extracted from ML models. The approach is validated on the Pima Indians Diabetes dataset by training two neural networks - one exclusively on data, and the other integrating a clinical protocol. Our findings demonstrate that the integrated ML model achieves comparable performance to that of a fully data-driven model while exhibiting superior accuracy relative to the clinical protocol, ensuring enhanced continuity of care. Furthermore, we show that our integrated model provides explanations for predictions that align more closely with the clinical protocol compared to the data-driven model.

Paper Structure

This paper contains 25 sections, 24 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Diagram illustrating the proposed approach for assessing explanation similarity between a knowledge base (KB) and the rule sets KB-ML$_X$ and DD-ML$_X$, derived from rule extraction from a data-driven model (DD-ML) and an integrated model (KB-ML), predicting diabetes (D) or healthy (H) outcomes for instances of the Pima Indians Diabetes dataset.
  • Figure 2: (a) Performance metrics for the integrated model (KB-ML) with parameter $\alpha$, ranging from 0 to 4, averaged over 100 iterations with 95% confidence intervals. For $\alpha = 0$, the model corresponds to the fully data-driven model (DD-ML). (b) Comparison of true labels, outcomes of the clinical protocol (KB) and prediction of the two models averaged over the 10 folds of the cross-validation.
  • Figure 3: Average accuracy (A) and F1-score (F1) with 95% confidence intervals of the rule sets DD-ML$_X$ and KB-ML$_X$, extracted from the data-driven models (DD-ML) or integrated models (KB-ML) using CART, with a varying number of rules extracted from 2 to 12. (b) Explanation similarity metrics (leveraging XNOR, Jaccard, Cosine, and Dice similarities) computed between the protocol and either DD-ML$_X$ or KB-ML$_X$ across 100 iterations, on all samples that can be predicted by all rule sets. (*) and (**) above the bar plots indicate significant differences between the values for the corresponding metric in DD-ML$_X$ and KB-ML$_X$ at a significance level of 0.05 and 0.01, respectively. (c) Explanation similarity metrics for robustness evaluation, leveraging XNOR similarities, to evaluate similarities across the 100 instances of DD-ML$_X$ and similarly for KB-ML$_X$.