Table of Contents
Fetching ...

DANIEL: A Distributed and Scalable Approach for Global Representation Learning with EHR Applications

Zebin Wang, Ziming Gan, Weijing Tang, Zongqi Xia, Tianrun Cai, Tianxi Cai, Junwei Lu

TL;DR

DANIEL reframes multi-institution, privacy-constrained learning of global embeddings for binary EHR data through a low-rank Ising model estimated with a non-convex bi-factored surrogate. The method achieves full-distribution guarantees with one-shot gradient communication, matching centralized rates while improving scalability and privacy. Theoretical results show minimax-like rates and robust initialization, and empirical evaluations on simulation and real-world EHRs demonstrate superior performance in relationship detection, phenotyping, clustering, and knowledge-graph construction. Collectively, DANIEL advances scalable, privacy-preserving statistical inference for high-dimensional biomedical data with broad applicability to federated healthcare analytics.

Abstract

Classical probabilistic graphical models face fundamental challenges in modern data environments, which are characterized by high dimensionality, source heterogeneity, and stringent data-sharing constraints. In this work, we revisit the Ising model, a well-established member of the Markov Random Field (MRF) family, and develop a distributed framework that enables scalable and privacy-preserving representation learning from large-scale binary data with inherent low-rank structure. Our approach optimizes a non-convex surrogate loss function via bi-factored gradient descent, offering substantial computational and communication advantages over conventional convex approaches. We evaluate our algorithm on multi-institutional electronic health record (EHR) datasets from 58,248 patients across the University of Pittsburgh Medical Center (UPMC) and Mass General Brigham (MGB), demonstrating superior performance in global representation learning and downstream clinical tasks, including relationship detection, patient phenotyping, and patient clustering. These results highlight a broader potential for statistical inference in federated, high-dimensional settings while addressing the practical challenges of data complexity and multi-institutional integration.

DANIEL: A Distributed and Scalable Approach for Global Representation Learning with EHR Applications

TL;DR

DANIEL reframes multi-institution, privacy-constrained learning of global embeddings for binary EHR data through a low-rank Ising model estimated with a non-convex bi-factored surrogate. The method achieves full-distribution guarantees with one-shot gradient communication, matching centralized rates while improving scalability and privacy. Theoretical results show minimax-like rates and robust initialization, and empirical evaluations on simulation and real-world EHRs demonstrate superior performance in relationship detection, phenotyping, clustering, and knowledge-graph construction. Collectively, DANIEL advances scalable, privacy-preserving statistical inference for high-dimensional biomedical data with broad applicability to federated healthcare analytics.

Abstract

Classical probabilistic graphical models face fundamental challenges in modern data environments, which are characterized by high dimensionality, source heterogeneity, and stringent data-sharing constraints. In this work, we revisit the Ising model, a well-established member of the Markov Random Field (MRF) family, and develop a distributed framework that enables scalable and privacy-preserving representation learning from large-scale binary data with inherent low-rank structure. Our approach optimizes a non-convex surrogate loss function via bi-factored gradient descent, offering substantial computational and communication advantages over conventional convex approaches. We evaluate our algorithm on multi-institutional electronic health record (EHR) datasets from 58,248 patients across the University of Pittsburgh Medical Center (UPMC) and Mass General Brigham (MGB), demonstrating superior performance in global representation learning and downstream clinical tasks, including relationship detection, patient phenotyping, and patient clustering. These results highlight a broader potential for statistical inference in federated, high-dimensional settings while addressing the practical challenges of data complexity and multi-institutional integration.

Paper Structure

This paper contains 39 sections, 12 theorems, 150 equations, 5 figures, 7 tables, 1 algorithm.

Key Result

Proposition 4

Consider an Ising graphical model proposed in Section sec:intro_model with the true parameter matrix $\boldsymbol{\Theta} ^ * = \{ \theta_{jk} ^ * \}$. If we have and there exists some $C > 0$ such that $\|\boldsymbol{\Theta} ^ *\|_{1, \infty} < C/p$, then $\|\mathbb{E}_{\boldsymbol{\Theta} ^ *} [\mathbf{W}_\ell (\boldsymbol{\Theta}) \mathbf{W}_\ell^{ \sf T}(\boldsymbol{\Theta})] \|_{\rm op}= O(p

Figures (5)

  • Figure 1: The trajectories of the $\|\cdot\|_{\text{F}}$-error and computation time across different feature dimensions $p$, with the distributedness level $x$ (where $m = \lfloor n^{x} \rfloor$) varying from 0 to 0.6. The total sample size is set at $n = 1,000$. The trajectories of DANIEL are shown in red, demonstrating superior performance and efficiency over baseline methods for distributedness levels $x < 0.5$ (i.e., $m = o(\sqrt{n})$). Vertically, lower $\|\cdot\|_{\text{F}}$-error and shorter computation time indicate better performance for any fixed $x$; horizontally, a flatter error trajectory as $x$ increases is preferred, as this indicates that the distributed estimators remain valid as compared with their centralized counterparts.
  • Figure 2: The trajectories of the $\|\cdot\|_{\text{F}}$-errors across different methods in the simulation setup with a total of $n=10,000$ samples partitioned into $m = 15$ institutions and the feature dimension $p$ increasing from 20 to 200. The slope of each trajectory indicates the sensitivity of estimation accuracy to increasing $p$. Flatter trajectories are preferred, as they indicate greater robustness to high-dimensional features.
  • Figure 3: Kaplan-Meier survival curves for two patient clusters obtained using $k$-means clustering on DANIEL-generated patient embeddings. The y-axis represents the estimated probability of nursing home admission, and the x-axis denotes time. A $p$-value is reported to assess statistical significance between the two clusters, with $p < 0.01$ indicating a difference with statistical significance.
  • Figure 4: Knowledge graphs of (a) top features for AD patients and (b) top features for MS patients. Node size reflects the occurrence probability of each feature. Red nodes correspond to PheCode for diagnosis, and blue nodes correspond to RxNorm for medication usage. The presence and thickness of edges between nodes are determined by the values of DANIEL-estimated parameter matrix $\widehat{\boldsymbol{\Theta}}$.
  • Figure 5: The trajectories of the $\|\cdot\|_{\text{F}}$-error and computation time across different feature dimensions $p$, with the distributedness level $x$ (where $m = \lfloor n^{x} \rfloor$) varying from 0 to 0.6. The total sample size is set at $n = 1,000$. The trajectories of DANIEL are shown in red, demonstrating superior performance and efficiency over baseline methods for distributedness levels $x < 0.5$ (i.e., $m = o(\sqrt{n})$). Vertically, lower $\|\cdot\|_{\text{F}}$-error and shorter computation time indicate better performance for any fixed $x$; horizontally, a flatter error trajectory as $x$ increases is preferred, as this indicates that the distributed estimators remain valid as compared with their centralized counterparts.

Theorems & Definitions (16)

  • Remark 1: Invariance of Symmetry
  • Proposition 4
  • Theorem 5: Statistical Rate of DANIEL
  • Remark 6: Theoretical Contribution of Theorem \ref{['thm:Divide_conquer']}
  • Theorem 7: Rate Guarantee for DANIEL Initialization
  • Remark 8: Contribution of Theorem \ref{['thm:main']} on Valid Initialization
  • Theorem 9: DANIEL's Proximity to the Centralized Estimator
  • Remark 10: Contribution of Strategy in Proof to Theorem \ref{['thm:rate_ctr_DANIEL']}
  • Lemma 11: Contraction in One Step of Contraction
  • Lemma 12: Statistical Error
  • ...and 6 more