Democratising Clinical AI through Dataset Condensation for Classical Clinical Models

Anshul Thakur; Soheila Molaei; Pafue Christy Nganjimi; Joshua Fieggen; Andrew A. S. Soltan; Danielle Belgrave; Lei Clifton; David A. Clifton

Democratising Clinical AI through Dataset Condensation for Classical Clinical Models

Anshul Thakur, Soheila Molaei, Pafue Christy Nganjimi, Joshua Fieggen, Andrew A. S. Soltan, Danielle Belgrave, Lei Clifton, David A. Clifton

TL;DR

Empirical results show that the proposed method produces condensed datasets that preserve model utility while providing effective differential privacy guarantees - enabling model-agnostic data sharing for clinical prediction tasks without exposing sensitive patient information.

Abstract

Dataset condensation (DC) learns a compact synthetic dataset that enables models to match the performance of full-data training, prioritising utility over distributional fidelity. While typically explored for computational efficiency, DC also holds promise for healthcare data democratisation, especially when paired with differential privacy, allowing synthetic data to serve as a safe alternative to real records. However, existing DC methods rely on differentiable neural networks, limiting their compatibility with widely used clinical models such as decision trees and Cox regression. We address this gap using a differentially private, zero-order optimisation framework that extends DC to non-differentiable models using only function evaluations. Empirical results across six datasets, including both classification and survival tasks, show that the proposed method produces condensed datasets that preserve model utility while providing effective differential privacy guarantees - enabling model-agnostic data sharing for clinical prediction tasks without exposing sensitive patient information.

Democratising Clinical AI through Dataset Condensation for Classical Clinical Models

TL;DR

Abstract

Paper Structure (16 sections, 15 equations, 5 figures, 5 tables)

This paper contains 16 sections, 15 equations, 5 figures, 5 tables.

Introduction
Results
Datasets
Performance on Prediction Tasks
Performance on Survival Analysis Tasks
Generalisation to External Cohorts and Models
Interpretability Comparison of Real and Condensed Models
Discussion
Method
Model Training
Dataset Condensation via Zero-order Gradient Estimation
Extension to Survival Analysis
Differential Privacy Mechanism
White-box Membership Inference Attack
Attribute Inference Attack
...and 1 more sections

Figures (5)

Figure 1: Overview of dataset condensation (DC) workflow and structure of synthetic data. (A) Schematic showing how DC integrates into clinical ML pipelines. (B) t-SNE projections (left) and nearest-neighbour distance distributions (right) for real and synthetic samples across PUH, OUH, and UK Biobank Proteomics datasets. Distributions compare distances from real to real and synthetic to real samples.
Figure 2: Kaplan–Meier (KM) curves for models trained on real and condensed datasets. The first row shows KM curves from A) XGBoost and B) Cox models trained on the Diabetes (UK Biobank) dataset and their respective best-performing condensed datasets. The second row shows corresponding KM curves from C) XGBoost and D) Cox models trained on the SEER dataset.
Figure 3: Cross-model evaluation of XGBoost-derived condensed data. Performance of support vector machine (SVM), Random Forest, and Logistic Regression models trained on condensed data generated using XGBoost, across four datasets: (A) PUH, (B) OUH, (C) UHB, and (D) Proteomics.
Figure 4: SHAP-based feature attribution comparison between models trained on real and condensed datasets. Top row: feature attributions from XGBoost models trained for COVID-19 prediction on (a) PUH and (b) OUH datasets. Bottom row: feature attributions from XGBoost-ARF models trained for survival prediction on (c) the UK Biobank diabetes dataset and (d) the SEER dataset. Each panel shows results using the best-performing condensed dataset.
Figure 5: Membership and attribute inference attacks on condensed data. A) Performance of a distance-based white-box membership inference attack, evaluated using AUROC, membership advantage, and true positive rate at a false positive rate of 0.1. B) Performance of the attribute inference attack, reported as $R^2$ scores for top attributes.

Democratising Clinical AI through Dataset Condensation for Classical Clinical Models

TL;DR

Abstract

Democratising Clinical AI through Dataset Condensation for Classical Clinical Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)