Table of Contents
Fetching ...

Unsupervised Discovery of Clinical Disease Signatures Using Probabilistic Independence

Thomas A. Lasko, John M. Still, Thomas Z. Li, Marco Barbero Mota, William W. Stead, Eric V. Strobl, Bennett A. Landman, Fabien Maldonado

TL;DR

This work tackles the problem of imprecise clinical diagnoses by learning high-resolution disease signatures from multi-modal EHR data using probabilistic independence. It converts episodic records into continuous longitudinal curves and factors the resulting matrix as $oldsymbol{X} = oldsymbol{A} oldsymbol{S}$, with the rows of $oldsymbol{S}$ representing mutually independent latent disease sources. The study demonstrates that signatures achieve better predictive power for lung malignancy than the original variables and can reveal undiagnosed cancer patterns, providing both predictive gain and interpretable insights into disease origins. These findings suggest that large-scale, unsupervised discovery of latent clinical sources can enhance diagnostic precision and offer actionable guidance for identifying hidden disease processes in clinical practice.

Abstract

Insufficiently precise diagnosis of clinical disease is likely responsible for many treatment failures, even for common conditions and treatments. With a large enough dataset, it may be possible to use unsupervised machine learning to define clinical disease patterns more precisely. We present an approach to learning these patterns by using probabilistic independence to disentangle the imprint on the medical record of causal latent sources of disease. We inferred a broad set of 2000 clinical signatures of latent sources from 9195 variables in 269,099 Electronic Health Records. The learned signatures produced better discrimination than the original variables in a lung cancer prediction task unknown to the inference algorithm, predicting 3-year malignancy in patients with no history of cancer before a solitary lung nodule was discovered. More importantly, the signatures' greater explanatory power identified pre-nodule signatures of apparently undiagnosed cancer in many of those patients.

Unsupervised Discovery of Clinical Disease Signatures Using Probabilistic Independence

TL;DR

This work tackles the problem of imprecise clinical diagnoses by learning high-resolution disease signatures from multi-modal EHR data using probabilistic independence. It converts episodic records into continuous longitudinal curves and factors the resulting matrix as , with the rows of representing mutually independent latent disease sources. The study demonstrates that signatures achieve better predictive power for lung malignancy than the original variables and can reveal undiagnosed cancer patterns, providing both predictive gain and interpretable insights into disease origins. These findings suggest that large-scale, unsupervised discovery of latent clinical sources can enhance diagnostic precision and offer actionable guidance for identifying hidden disease processes in clinical practice.

Abstract

Insufficiently precise diagnosis of clinical disease is likely responsible for many treatment failures, even for common conditions and treatments. With a large enough dataset, it may be possible to use unsupervised machine learning to define clinical disease patterns more precisely. We present an approach to learning these patterns by using probabilistic independence to disentangle the imprint on the medical record of causal latent sources of disease. We inferred a broad set of 2000 clinical signatures of latent sources from 9195 variables in 269,099 Electronic Health Records. The learned signatures produced better discrimination than the original variables in a lung cancer prediction task unknown to the inference algorithm, predicting 3-year malignancy in patients with no history of cancer before a solitary lung nodule was discovered. More importantly, the signatures' greater explanatory power identified pre-nodule signatures of apparently undiagnosed cancer in many of those patients.
Paper Structure (33 sections, 8 figures, 1 table)

This paper contains 33 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: The pipeline for learning clinical signatures and their patient-level expressions from noisy, asynchronous, and irregular EHR data. The first three steps transform the data into a dense, regular matrix $\mathbf{X}$ for machine learning. The final step infers the clinical signatures $\mathbf{A}$ and the expression levels $\mathbf{S}$ of latent disease sources using probabilistic independence.
  • Figure 2: Example causal graph showing observed variables $X_i$, label $Y$, and unobserved (latent) sources $S_i$.
  • Figure 3: An example clinical signature, selected to illustrate the recovery of low-prevalence sources (here, a few hundred out of 630,000 sampled cross sections), and not cherry-picked for clinical coherence. We interpret this signature as representing the rare condition Spasmodic Torticollis Termsarasab2016, a specific type of dystonia. Bar length gives the size of the change in a given standardized variable, with numbers in parentheses indicating the change in original data units for a 1.0 change in expression. Changes may be either multiplicative ($\times \cdots$) or additive ($+ \cdots$). Inset is a log-scaled histogram of expression levels in the Discovery Set sample matrix. Expression units are individually scaled for each signature such that the standard deviation is 0.5, placing 95% of all expressions within the interval $[-1,1]$.
  • Figure 4: A second example signature of a rarely-expressed source, which we interpret as Idiopathic pulmonary fibrosis that is treated with pirfenidone. Several other signatures relate to this same condition without the treatment.
  • Figure 5: Histograms of area under the ROC curve (AUC) on the held out test set for all six models, each retrained 100 times with different random seeds. Models using the learned signatures $\mathbf{S}$ (solid lines) were more predictive than those using the observed data matrix $\mathbf{X}$ (dotted lines). Model variability solely due to the choice of random seed was substantial, especially for the XGBoost models. Median performance is a more robust indicator of performance; extremes in the tails are unlikely to generalize to unseen data DAmour2022.
  • ...and 3 more figures