Medical records condensation: a roadmap towards healthcare data democratisation
Yujiang Wang, Anshul Thakur, Mingzhi Dong, Pingchuan Ma, Stavros Petridis, Li Shang, Tingting Zhu, David A. Clifton
TL;DR
The paper tackles the barrier of healthcare data democratisation caused by sensitive patient information and interoperability costs. It introduces dataset condensation (DC) as a method to generate condensed datasets that remove individual-level data while preserving deep learning performance, achieved by matching embeddings via Maximum Mean Discrepancy (MMD) using randomly initialised networks. Across PhysioNet-2012, MIMIC-III, and Coswara, DC yields substantial data compression and accelerated training with only modest drops in AUC for multiple architectures, illustrating DC's potential to accelerate and broaden AI research in healthcare. The authors discuss the no-free-lunch nature of de-identification, compare DC with other approaches, and highlight DC's promise for enabling open, AI-oriented knowledge sharing in healthcare, while acknowledging limitations and directions for future work.
Abstract
The prevalence of artificial intelligence (AI) has envisioned an era of healthcare democratisation that promises every stakeholder a new and better way of life. However, the advancement of clinical AI research is significantly hurdled by the dearth of data democratisation in healthcare. To truly democratise data for AI studies, challenges are two-fold: 1. the sensitive information in clinical data should be anonymised appropriately, and 2. AI-oriented clinical knowledge should flow freely across organisations. This paper considers a recent deep-learning advent, dataset condensation (DC), as a stone that kills two birds in democratising healthcare data. The condensed data after DC, which can be viewed as statistical metadata, abstracts original clinical records and irreversibly conceals sensitive information at individual levels; nevertheless, it still preserves adequate knowledge for learning deep neural networks (DNNs). More favourably, the compressed volumes and the accelerated model learnings of condensed data portray a more efficient clinical knowledge sharing and flowing system, as necessitated by data democratisation. We underline DC's prospects for democratising clinical data, specifically electrical healthcare records (EHRs), for AI research through experimental results and analysis across three healthcare datasets of varying data types.
