Table of Contents
Fetching ...

Privacy-Preserving EHR Data Transformation via Geometric Operators: A Human-AI Co-Design Technical Report

Maolin Wang, Beining Bao, Gan Yuan, Hongyu Chen, Bingkun Zhao, Baoshuo Kan, Jiming Xu, Qi Shi, Yinggong Zhao, Yao Wang, Wei Ying Ma, Jun Yan

Abstract

Electronic health records (EHRs) and other real-world clinical data are essential for clinical research, medical artificial intelligence, and life science, but their sharing is severely limited by privacy, governance, and interoperability constraints. These barriers create persistent data silos that hinder multi-center studies, large-scale model development, and broader biomedical discovery. Existing privacy-preserving approaches, including multi-party computation and related cryptographic techniques, provide strong protection but often introduce substantial computational overhead, reducing the efficiency of large-scale machine learning and foundation-model training. In addition, many such methods make data usable for restricted computation while leaving them effectively invisible to clinicians and researchers, limiting their value in workflows that still require direct inspection, exploratory analysis, and human interpretation. We propose a real-world-data transformation framework for privacy-preserving sharing of structured clinical records. Instead of converting data into opaque representations, our approach constructs transformed numeric views that preserve medical semantics and major statistical properties while, under a clearly specified threat model, provably breaking direct linkage between those views and protected patient-level attributes. Through collaboration between computer scientists and the AI agent \textbf{SciencePal}, acting as a constrained tool inventor under human guidance, we design three transformation operators that are non-reversible within this threat model, together with an additional mixing strategy for high-risk scenarios, supported by theoretical analysis and empirical evaluation under reconstruction, record linkage, membership inference, and attribute inference attacks.

Privacy-Preserving EHR Data Transformation via Geometric Operators: A Human-AI Co-Design Technical Report

Abstract

Electronic health records (EHRs) and other real-world clinical data are essential for clinical research, medical artificial intelligence, and life science, but their sharing is severely limited by privacy, governance, and interoperability constraints. These barriers create persistent data silos that hinder multi-center studies, large-scale model development, and broader biomedical discovery. Existing privacy-preserving approaches, including multi-party computation and related cryptographic techniques, provide strong protection but often introduce substantial computational overhead, reducing the efficiency of large-scale machine learning and foundation-model training. In addition, many such methods make data usable for restricted computation while leaving them effectively invisible to clinicians and researchers, limiting their value in workflows that still require direct inspection, exploratory analysis, and human interpretation. We propose a real-world-data transformation framework for privacy-preserving sharing of structured clinical records. Instead of converting data into opaque representations, our approach constructs transformed numeric views that preserve medical semantics and major statistical properties while, under a clearly specified threat model, provably breaking direct linkage between those views and protected patient-level attributes. Through collaboration between computer scientists and the AI agent \textbf{SciencePal}, acting as a constrained tool inventor under human guidance, we design three transformation operators that are non-reversible within this threat model, together with an additional mixing strategy for high-risk scenarios, supported by theoretical analysis and empirical evaluation under reconstruction, record linkage, membership inference, and attribute inference attacks.
Paper Structure (169 sections, 71 equations, 33 figures, 5 tables)

This paper contains 169 sections, 71 equations, 33 figures, 5 tables.

Figures (33)

  • Figure 1: SciencePal co-design protocol for geometric operator families. Step 1: Humans specify operator clauses C1--C4 and threat model C5. Step 2: SciencePal searches PPDP/PPDM and related methods, finding no operator that satisfies all clauses. Step 3: SciencePal proposes candidate operators T1, T2, and T3. Step 4: Humans prove properties and run attack-based evaluation, keeping T1/T2 and rejecting T3. Step 5: Observing residual high reconstruction risk for some variables, humans and SciencePal co-design per-stay Q-mix extensions, yielding the final operator family.
  • Figure 2: Overview of the column-wise geometric operators studied in this work. T1 applies local triplet rotations on $\mathcal{M}(0,1)$, preserving mean/variance and short-range autocorrelation with moderate privacy. T2 adds bounded z-score noise with global $\ell_\infty$ limit $\alpha$, then recenters to $\mathcal{M}(0,1)$, giving stronger privacy while largely preserving marginals and correlations. T3 is a global Householder reflection that preserves low-order statistics but is highly invertible ($R^{2} \approx 1$), and serves as a negative control. Q-mix wraps T1/T2 with per-stay orthogonal mixing on selected high-risk variables, keeping fixed while sharply reducing reconstruction risk and retaining acceptable utility.
  • Figure 3: Geometric sanity heatmaps. Left: log$_{10}$ of the maximum absolute mean deviation $\Delta\mu_{\max}$ by setting (phys vs z) and operator. Right: max_abs_delta_mean by setting and operator. All geometric operators keep mean deviations at or below machine precision.
  • Figure 4: Boxplots of $|\Delta\mu_v|$ and $|\Delta\sigma_v|$ grouped by evaluation setting. Deviations in standardized space are essentially at numerical precision; in physical space they remain on the order of $10^{-13}$–$10^{-14}$.
  • Figure 5: Average z-score $\ell_\infty$ perturbation $\|\delta_{s,v}\|_\infty$ by operator for two target bounds $\alpha = 0.5$ and $\alpha = 1.0$. For all T1/T2/T3 operators the empirical maximum $\|\delta_{s,v}\|_\infty$ never exceeds the configured $\alpha$; the average perturbation typically lies in the $0.7$–$0.9$$\alpha$ range.
  • ...and 28 more figures