Table of Contents
Fetching ...

Can Machine Learning Assist in Diagnosis of Primary Immune Thrombocytopenia? A feasibility study

Haroon Miah, Dimitrios Kollias, Giacinto Luca Pedone, Drew Provan, Frederick Chen

TL;DR

Primary Immune Thrombocytopenia (ITP) lacks a definitive diagnostic biomarker, prompting exploration of machine learning to diagnose ITP from routine outpatient blood tests and demographics. The study benchmarks five classical classifiers (Logistic Regression, SVM, k-NN, DT, RF) under demographic-aware and demographic-unaware input schemes using UK Adult ITP Registry data, with macro F1 and Equalized Odds as evaluation metrics and permutation importance for interpretability. Random Forest and Decision Tree achieve near-perfect predictive performance and high fairness, with platelet count (dx_plt_ct) identified as the most influential predictor; demographic-aware models improve fairness but often reduce accuracy, illustrating a performance–fairness trade-off. The work demonstrates potential for earlier, outpatient ITP screening and streamlined referrals, while acknowledging the need for larger datasets and broader features in future work.

Abstract

Primary Immune thrombocytopenia (ITP) is a rare autoimmune disease characterised by immune-mediated destruction of peripheral blood platelets in patients leading to low platelet counts and bleeding. The diagnosis and effective management of ITP is challenging because there is no established test to confirm the disease and no biomarker with which one can predict the response to treatment and outcome. In this work we conduct a feasibility study to check if machine learning can be applied effectively for diagnosis of ITP using routine blood tests and demographic data in a non-acute outpatient setting. Various ML models, including Logistic Regression, Support Vector Machine, k-Nearest Neighbor, Decision Tree and Random Forest, were applied to data from the UK Adult ITP Registry and a general hematology clinic. Two different approaches were investigated: a demographic-unaware and a demographic-aware one. We conduct extensive experiments to evaluate the predictive performance of these models and approaches, as well as their bias. The results revealed that Decision Tree and Random Forest models were both superior and fair, achieving nearly perfect predictive and fairness scores, with platelet count identified as the most significant variable. Models not provided with demographic information performed better in terms of predictive accuracy but showed lower fairness score, illustrating a trade-off between predictive performance and fairness.

Can Machine Learning Assist in Diagnosis of Primary Immune Thrombocytopenia? A feasibility study

TL;DR

Primary Immune Thrombocytopenia (ITP) lacks a definitive diagnostic biomarker, prompting exploration of machine learning to diagnose ITP from routine outpatient blood tests and demographics. The study benchmarks five classical classifiers (Logistic Regression, SVM, k-NN, DT, RF) under demographic-aware and demographic-unaware input schemes using UK Adult ITP Registry data, with macro F1 and Equalized Odds as evaluation metrics and permutation importance for interpretability. Random Forest and Decision Tree achieve near-perfect predictive performance and high fairness, with platelet count (dx_plt_ct) identified as the most influential predictor; demographic-aware models improve fairness but often reduce accuracy, illustrating a performance–fairness trade-off. The work demonstrates potential for earlier, outpatient ITP screening and streamlined referrals, while acknowledging the need for larger datasets and broader features in future work.

Abstract

Primary Immune thrombocytopenia (ITP) is a rare autoimmune disease characterised by immune-mediated destruction of peripheral blood platelets in patients leading to low platelet counts and bleeding. The diagnosis and effective management of ITP is challenging because there is no established test to confirm the disease and no biomarker with which one can predict the response to treatment and outcome. In this work we conduct a feasibility study to check if machine learning can be applied effectively for diagnosis of ITP using routine blood tests and demographic data in a non-acute outpatient setting. Various ML models, including Logistic Regression, Support Vector Machine, k-Nearest Neighbor, Decision Tree and Random Forest, were applied to data from the UK Adult ITP Registry and a general hematology clinic. Two different approaches were investigated: a demographic-unaware and a demographic-aware one. We conduct extensive experiments to evaluate the predictive performance of these models and approaches, as well as their bias. The results revealed that Decision Tree and Random Forest models were both superior and fair, achieving nearly perfect predictive and fairness scores, with platelet count identified as the most significant variable. Models not provided with demographic information performed better in terms of predictive accuracy but showed lower fairness score, illustrating a trade-off between predictive performance and fairness.
Paper Structure (13 sections, 2 equations, 17 figures, 3 tables)

This paper contains 13 sections, 2 equations, 17 figures, 3 tables.

Figures (17)

  • Figure S1: The boxplot of the variables across the ITP patients, namely Diagnosis Year, Age, Blood Alt Level, Blood Haemaglobin Level, Blood Neutrophil Level, White Blood Cell Count, Red Blood Cell Count and Blood Platelet Count.
  • Figure S2: The boxplot of the variables across the non-ITP patients, namely Diagnosis Year, Age, Blood Alt Level, Blood Haemaglobin Level, Blood Neutrophil Level, White Blood Cell Count, Red Blood Cell Count and Blood Platelet Count.
  • Figure S3: The gender distributions in the case of ITP (right side) and non-ITP (left side) patients.
  • Figure S4: Permutation feature importance on the training set for the RF model in the case of the demographic-aware (on the left side) and demographic-unaware (on the right side) approach
  • Figure S5: Permutation feature importance on the test set for the RF model in the case of the demographic-aware (on the left side) and demographic-unaware (on the right side) approach
  • ...and 12 more figures