Table of Contents
Fetching ...

Private and Collaborative Kaplan-Meier Estimators

Shadi Rahimian, Raouf Kerkouche, Ina Kurth, Mario Fritz

TL;DR

This work tackles the problem of privacy-preserving joint Kaplan-Meier estimation across multiple data-holding sites. It introduces two differential-privacy-based methods, DP-Surv and DP-Prob, which directly perturb survival-related representations, along with DP-Matrix$^+$ as a baseline, and a surrogate-dataset generator to enable flexible representation conversion. The authors present a taxonomy of seven collaboration paths for building a global private KM estimator, and demonstrate via experiments on real medical datasets that a joint estimator with a bound such as $\varepsilon=1$ can closely match the centralized non-private KM curve while protecting individual data. The surrogate dataset approach and multi-representation DP strategies facilitate scalable, privacy-preserving collaboration for survival analysis, with strong practical impact for cross-institutional studies. Future work includes extending the DP framework to censored data scenarios and refining sensitivity analyses to improve utility under censoring.

Abstract

Kaplan-Meier estimators are essential tools in survival analysis, capturing the survival behavior of a cohort. Their accuracy improves with large, diverse datasets, encouraging data holders to collaborate for more precise estimations. However, these datasets often contain sensitive individual information, necessitating stringent data protection measures that preclude naive data sharing. In this work, we introduce two novel differentially private methods that offer flexibility in applying differential privacy to various functions of the data. Additionally, we propose a synthetic dataset generation technique that enables easy and rapid conversion between different data representations. Utilizing these methods, we propose various paths that allow a joint estimation of the Kaplan-Meier curves with strict privacy guarantees. Our contribution includes a taxonomy of methods for this task and an extensive experimental exploration and evaluation based on this structure. We demonstrate that our approach can construct a joint, global Kaplan-Meier estimator that adheres to strict privacy standards ($\varepsilon = 1$) while exhibiting no statistically significant deviation from the nonprivate centralized estimator.

Private and Collaborative Kaplan-Meier Estimators

TL;DR

This work tackles the problem of privacy-preserving joint Kaplan-Meier estimation across multiple data-holding sites. It introduces two differential-privacy-based methods, DP-Surv and DP-Prob, which directly perturb survival-related representations, along with DP-Matrix as a baseline, and a surrogate-dataset generator to enable flexible representation conversion. The authors present a taxonomy of seven collaboration paths for building a global private KM estimator, and demonstrate via experiments on real medical datasets that a joint estimator with a bound such as can closely match the centralized non-private KM curve while protecting individual data. The surrogate dataset approach and multi-representation DP strategies facilitate scalable, privacy-preserving collaboration for survival analysis, with strong practical impact for cross-institutional studies. Future work includes extending the DP framework to censored data scenarios and refining sensitivity analyses to improve utility under censoring.

Abstract

Kaplan-Meier estimators are essential tools in survival analysis, capturing the survival behavior of a cohort. Their accuracy improves with large, diverse datasets, encouraging data holders to collaborate for more precise estimations. However, these datasets often contain sensitive individual information, necessitating stringent data protection measures that preclude naive data sharing. In this work, we introduce two novel differentially private methods that offer flexibility in applying differential privacy to various functions of the data. Additionally, we propose a synthetic dataset generation technique that enables easy and rapid conversion between different data representations. Utilizing these methods, we propose various paths that allow a joint estimation of the Kaplan-Meier curves with strict privacy guarantees. Our contribution includes a taxonomy of methods for this task and an extensive experimental exploration and evaluation based on this structure. We demonstrate that our approach can construct a joint, global Kaplan-Meier estimator that adheres to strict privacy standards () while exhibiting no statistically significant deviation from the nonprivate centralized estimator.
Paper Structure (36 sections, 6 theorems, 68 equations, 8 figures, 13 tables, 3 algorithms)

This paper contains 36 sections, 6 theorems, 68 equations, 8 figures, 13 tables, 3 algorithms.

Key Result

Theorem 1

Let $\mathcal{A}$ be an $\varepsilon-$DP privacy mechanism which assigns a value $\mathit{Range}(\mathcal{A})$ to a dataset $D$. Let $\mathcal{B}$ be an arbitrary randomized mapping that takes as input $O \in \mathit{Range}(\mathcal{A})$ and returns $O' \in \mathit{Range}(\mathcal{B})$. Then $\mathc

Figures (8)

  • Figure 1: A simple illustrative example of Kaplan-Meier and probability estimators for a dataset of 5 individuals.
  • Figure 2: Overall scheme of paths that are possible to construct a collaborative private KM estimator over the union of datasets.
  • Figure 3: Comparison of all the DP methods in a centralized setting, for one random run of the DP algorithms. The blue shaded region shows the confidence area of the non-private dataset.
  • Figure 4: Comparison of all the DP methods in a centralized setting, for $\varepsilon=1.0$ and one random run of the DP algorithms. The blue shaded region shows the confidence area of the non-private dataset.
  • Figure 5: Collaboration among 10 sites for 3 types of data splitting. Our private DP-Surv method is shown with the red line. The median and the $p-$value to the non-private, centralized estimator is shown by m and p for our method and also for each site when only the local data is used to construct the KM curve.
  • ...and 3 more figures

Theorems & Definitions (9)

  • Definition 1: $\varepsilon$-Differential Privacy Dwork2014book
  • Definition 2: Global $L_p$-sensitivity
  • Definition 3: Laplace Mechanism Dwork2014book
  • Theorem 1: Post-Processing Property Dwork2014book
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Lemma 5
  • Lemma 6