Table of Contents
Fetching ...

Medical records condensation: a roadmap towards healthcare data democratisation

Yujiang Wang, Anshul Thakur, Mingzhi Dong, Pingchuan Ma, Stavros Petridis, Li Shang, Tingting Zhu, David A. Clifton

TL;DR

The paper tackles the barrier of healthcare data democratisation caused by sensitive patient information and interoperability costs. It introduces dataset condensation (DC) as a method to generate condensed datasets that remove individual-level data while preserving deep learning performance, achieved by matching embeddings via Maximum Mean Discrepancy (MMD) using randomly initialised networks. Across PhysioNet-2012, MIMIC-III, and Coswara, DC yields substantial data compression and accelerated training with only modest drops in AUC for multiple architectures, illustrating DC's potential to accelerate and broaden AI research in healthcare. The authors discuss the no-free-lunch nature of de-identification, compare DC with other approaches, and highlight DC's promise for enabling open, AI-oriented knowledge sharing in healthcare, while acknowledging limitations and directions for future work.

Abstract

The prevalence of artificial intelligence (AI) has envisioned an era of healthcare democratisation that promises every stakeholder a new and better way of life. However, the advancement of clinical AI research is significantly hurdled by the dearth of data democratisation in healthcare. To truly democratise data for AI studies, challenges are two-fold: 1. the sensitive information in clinical data should be anonymised appropriately, and 2. AI-oriented clinical knowledge should flow freely across organisations. This paper considers a recent deep-learning advent, dataset condensation (DC), as a stone that kills two birds in democratising healthcare data. The condensed data after DC, which can be viewed as statistical metadata, abstracts original clinical records and irreversibly conceals sensitive information at individual levels; nevertheless, it still preserves adequate knowledge for learning deep neural networks (DNNs). More favourably, the compressed volumes and the accelerated model learnings of condensed data portray a more efficient clinical knowledge sharing and flowing system, as necessitated by data democratisation. We underline DC's prospects for democratising clinical data, specifically electrical healthcare records (EHRs), for AI research through experimental results and analysis across three healthcare datasets of varying data types.

Medical records condensation: a roadmap towards healthcare data democratisation

TL;DR

The paper tackles the barrier of healthcare data democratisation caused by sensitive patient information and interoperability costs. It introduces dataset condensation (DC) as a method to generate condensed datasets that remove individual-level data while preserving deep learning performance, achieved by matching embeddings via Maximum Mean Discrepancy (MMD) using randomly initialised networks. Across PhysioNet-2012, MIMIC-III, and Coswara, DC yields substantial data compression and accelerated training with only modest drops in AUC for multiple architectures, illustrating DC's potential to accelerate and broaden AI research in healthcare. The authors discuss the no-free-lunch nature of de-identification, compare DC with other approaches, and highlight DC's promise for enabling open, AI-oriented knowledge sharing in healthcare, while acknowledging limitations and directions for future work.

Abstract

The prevalence of artificial intelligence (AI) has envisioned an era of healthcare democratisation that promises every stakeholder a new and better way of life. However, the advancement of clinical AI research is significantly hurdled by the dearth of data democratisation in healthcare. To truly democratise data for AI studies, challenges are two-fold: 1. the sensitive information in clinical data should be anonymised appropriately, and 2. AI-oriented clinical knowledge should flow freely across organisations. This paper considers a recent deep-learning advent, dataset condensation (DC), as a stone that kills two birds in democratising healthcare data. The condensed data after DC, which can be viewed as statistical metadata, abstracts original clinical records and irreversibly conceals sensitive information at individual levels; nevertheless, it still preserves adequate knowledge for learning deep neural networks (DNNs). More favourably, the compressed volumes and the accelerated model learnings of condensed data portray a more efficient clinical knowledge sharing and flowing system, as necessitated by data democratisation. We underline DC's prospects for democratising clinical data, specifically electrical healthcare records (EHRs), for AI research through experimental results and analysis across three healthcare datasets of varying data types.
Paper Structure (4 sections, 5 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 4 sections, 5 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Visualisations of PhysioNet-2012 results. a, The average AUCs across 11 DNNs between the original train set and three scales of condensed data. We post-pend the number of samples to each data group. b, The violin plot of all DNNs' test AUCs across the four data groups. "Ori." and "Con." are abbreviations of "Original" and "Condensed", respectively. c, Comparisons of the average AUC and disk space occupation across different data groups. Note that we turn on the logarithm scale to display space usage.
  • Figure 2: Visualisations of MIMIC-III results. a, The average AUCs across 11 DNNs between the original and three condensed sets. b, The violin plot of all DNNs' test AUCs across the four groups. c, Summarizations of AUCs and data sizes across data groups. We turn on the logarithm scale to display space usage.
  • Figure 3: Visualisations of Coswara results. a, The average AUCs across 11 DNNs between the original and three condensed sets. b, The violin plot of all DNNs' test AUCs across the four groups. c, Summarizations of AUCs and data sizes across data groups. We turn on the logarithm scale to display space usage.
  • Figure 4: a, The learning curves of TCN- trained on the original and condensed data of PhysioNet-2012. b, The learning curves of TRSF- trained on the original and condensed data of MIMIC-III. The red-dotted line indicates the step of convergence.
  • Figure 5: Visualisations of original and condensed samples. a, the heatmaps of three original and three condensed samples randomly drawn from MIMIC-III. b, the 48-hour trends of six clinical variables from PhysioNet-2012 computed from 80 original and 80 condensed instances, respectively.
  • ...and 2 more figures