Table of Contents
Fetching ...

Improving Clinical Dataset Condensation with Mode Connectivity-based Trajectory Surrogates

Pafue Christy Nganjimi, Andrew Soltan, Danielle Belgrave, Lei Clifton, David A. Clifton, Anshul Thakur

TL;DR

This work tackles the limitations of Trajectory Matching in clinical dataset condensation, where noisy SGD paths and dense trajectory storage hinder practical deployment. It introduces mode connectivity–based trajectory surrogates, specifically quadratic Bézier curves $\Phi_{\boldsymbol{\phi}}(t)$ connecting $\boldsymbol{\theta}_0$ and $\boldsymbol{\theta}_T$, to provide smooth, low-curvature supervision for condensation. The authors prove theoretical guarantees showing near-optimal average loss along the Bézier path and reduced curvature, while empirical results on five clinical datasets demonstrate that Bézier Trajectory Matching (BTM) achieves competitive or superior performance with substantial compression and cross-architecture generalisation. The approach significantly lowers storage requirements by storing only three points per trajectory and enables privacy-friendly data sharing, with potential integration of differential privacy for formal guarantees in future work.

Abstract

Dataset condensation (DC) enables the creation of compact, privacy-preserving synthetic datasets that can match the utility of real patient records, supporting democratised access to highly regulated clinical data for developing downstream clinical models. State-of-the-art DC methods supervise synthetic data by aligning the training dynamics of models trained on real and those trained on synthetic data, typically using full stochastic gradient descent (SGD) trajectories as alignment targets; however, these trajectories are often noisy, high-curvature, and storage-intensive, leading to unstable gradients, slow convergence, and substantial memory overhead. We address these limitations by replacing full SGD trajectories with smooth, low-loss parametric surrogates, specifically quadratic Bézier curves that connect the initial and final model states from real training trajectories. These mode-connected paths provide noise-free, low-curvature supervision signals that stabilise gradients, accelerate convergence, and eliminate the need for dense trajectory storage. We theoretically justify Bézier-mode connections as effective surrogates for SGD paths and empirically show that the proposed method outperforms state-of-the-art condensation approaches across five clinical datasets, yielding condensed datasets that enable clinically effective model development.

Improving Clinical Dataset Condensation with Mode Connectivity-based Trajectory Surrogates

TL;DR

This work tackles the limitations of Trajectory Matching in clinical dataset condensation, where noisy SGD paths and dense trajectory storage hinder practical deployment. It introduces mode connectivity–based trajectory surrogates, specifically quadratic Bézier curves connecting and , to provide smooth, low-curvature supervision for condensation. The authors prove theoretical guarantees showing near-optimal average loss along the Bézier path and reduced curvature, while empirical results on five clinical datasets demonstrate that Bézier Trajectory Matching (BTM) achieves competitive or superior performance with substantial compression and cross-architecture generalisation. The approach significantly lowers storage requirements by storing only three points per trajectory and enables privacy-friendly data sharing, with potential integration of differential privacy for formal guarantees in future work.

Abstract

Dataset condensation (DC) enables the creation of compact, privacy-preserving synthetic datasets that can match the utility of real patient records, supporting democratised access to highly regulated clinical data for developing downstream clinical models. State-of-the-art DC methods supervise synthetic data by aligning the training dynamics of models trained on real and those trained on synthetic data, typically using full stochastic gradient descent (SGD) trajectories as alignment targets; however, these trajectories are often noisy, high-curvature, and storage-intensive, leading to unstable gradients, slow convergence, and substantial memory overhead. We address these limitations by replacing full SGD trajectories with smooth, low-loss parametric surrogates, specifically quadratic Bézier curves that connect the initial and final model states from real training trajectories. These mode-connected paths provide noise-free, low-curvature supervision signals that stabilise gradients, accelerate convergence, and eliminate the need for dense trajectory storage. We theoretically justify Bézier-mode connections as effective surrogates for SGD paths and empirically show that the proposed method outperforms state-of-the-art condensation approaches across five clinical datasets, yielding condensed datasets that enable clinically effective model development.

Paper Structure

This paper contains 27 sections, 1 theorem, 32 equations, 4 figures, 10 tables, 2 algorithms.

Key Result

Theorem 1

Let $\mathcal{L}:\Theta \to \mathbb{R}$ be a $\beta$-smooth, lower-bounded loss function, $\boldsymbol{\theta}_0 \in \Theta$ be a random initialisation with loss $\ell_0 = \mathcal{L}(\boldsymbol{\theta}_0)$, and $\boldsymbol{\theta}_T$ be an SGD endpoint after $K$ steps such that $\|\nabla \mathcal Assume the model map $f_{\boldsymbol{\theta}}(\boldsymbol{x})$ is $L_f$-Lipschitz in $\boldsymbol{\

Figures (4)

  • Figure 1: Illustration of the key differences between traditional SGD trajectories and mode-connected paths in TM. (a) SGD trajectories are noisy and require many intermediate checkpoints, whereas mode-connected paths give smooth, direct connections using only the start and end models. (b) The average loss of the training set fluctuates and has high curvature along SGD trajectories, while mode-connected paths yield a stable, smoothly decreasing profile. (c) Mode-connected surrogates accelerate optimisation and reach lower trajectory-matching loss than raw SGD trajectories.
  • Figure 2: Performance comparison of different methods across ipc levels on eICU and MIMIC-III datasets.
  • Figure 3: Trajectory storage requirements across clinical datasets. SGD trajectories require 33$\times$ (eICU, CURIAL) and 20$\times$ (MIMIC-III) more storage than Bézier surrogates, which store only initial, final, and control checkpoints.
  • Figure 4: Impact of inner-loop steps $N$ on AUROC performance at 200 ipc. BTM achieves strong performance with only 30 steps, reducing computational overhead. Similar trends observed for AUPRC.

Theorems & Definitions (3)

  • Theorem 1
  • proof
  • proof