Table of Contents
Fetching ...

Imputation of Unknown Missingness in Sparse Electronic Health Records

Jun Han, Josue Nassar, Sanjit Singh Batra, Aldo Cordova-Palomera, Vijay Nori, Robert E. Tillman

TL;DR

A transformer-based denoising neural network where the output is thresholded adaptively to recover values in cases where the authors predict data are missing in binary EHRs, demonstrating improved accuracy in denoising medical codes within a real EHR dataset compared to existing imputation approaches.

Abstract

Machine learning holds great promise for advancing the field of medicine, with electronic health records (EHRs) serving as a primary data source. However, EHRs are often sparse and contain missing data due to various challenges and limitations in data collection and sharing between healthcare providers. Existing techniques for imputing missing values predominantly focus on known unknowns, such as missing or unavailable values of lab test results; most do not explicitly address situations where it is difficult to distinguish what is missing. For instance, a missing diagnosis code in an EHR could signify either that the patient has not been diagnosed with the condition or that a diagnosis was made, but not shared by a provider. Such situations fall into the paradigm of unknown unknowns. To address this challenge, we develop a general purpose algorithm for denoising data to recover unknown missing values in binary EHRs. We design a transformer-based denoising neural network where the output is thresholded adaptively to recover values in cases where we predict data are missing. Our results demonstrate improved accuracy in denoising medical codes within a real EHR dataset compared to existing imputation approaches and leads to increased performance on downstream tasks using the denoised data. In particular, when applying our method to a real world application, predicting hospital readmission from EHRs, our method achieves statistically significant improvement over all existing baselines.

Imputation of Unknown Missingness in Sparse Electronic Health Records

TL;DR

A transformer-based denoising neural network where the output is thresholded adaptively to recover values in cases where the authors predict data are missing in binary EHRs, demonstrating improved accuracy in denoising medical codes within a real EHR dataset compared to existing imputation approaches.

Abstract

Machine learning holds great promise for advancing the field of medicine, with electronic health records (EHRs) serving as a primary data source. However, EHRs are often sparse and contain missing data due to various challenges and limitations in data collection and sharing between healthcare providers. Existing techniques for imputing missing values predominantly focus on known unknowns, such as missing or unavailable values of lab test results; most do not explicitly address situations where it is difficult to distinguish what is missing. For instance, a missing diagnosis code in an EHR could signify either that the patient has not been diagnosed with the condition or that a diagnosis was made, but not shared by a provider. Such situations fall into the paradigm of unknown unknowns. To address this challenge, we develop a general purpose algorithm for denoising data to recover unknown missing values in binary EHRs. We design a transformer-based denoising neural network where the output is thresholded adaptively to recover values in cases where we predict data are missing. Our results demonstrate improved accuracy in denoising medical codes within a real EHR dataset compared to existing imputation approaches and leads to increased performance on downstream tasks using the denoised data. In particular, when applying our method to a real world application, predicting hospital readmission from EHRs, our method achieves statistically significant improvement over all existing baselines.
Paper Structure (25 sections, 2 theorems, 27 equations, 6 figures, 8 tables)

This paper contains 25 sections, 2 theorems, 27 equations, 6 figures, 8 tables.

Key Result

Theorem 1

Let $p(\mathbf{x})$ be the data distribution, let $p(\mathbf{\tilde{x}} \mid \mathbf{x})$ the noise distribution as defined in eq:new_noise_corruption and let $q(\mathbf{\tilde{x}}) = \int q(\mathbf{\tilde{x}} \mid \mathbf{x}) p(\mathbf{x}) d\mathbf{x}$ the marginal distribution of the noisy data. W where

Figures (6)

  • Figure 1: A) An example of imputing known unknowns. In the patient EHR, there is $\textrm{NaN}$ in ICD 3. Thus, there are only two possible choices for ICD 3: 0 or 1. B) An example of imputing unknown unknowns. In the patient EHR, there is a 0 in ICD 2 and ICD 3. Due to reporting issues, it is unclear whether a 0 represents a negative diagnosis (0) or a missing diagnosis (1). Due to this ambiguity, there are 4 possibilities after imputing: i) ICD 2 and 3 are both negative diagnoses, thus they both are 0. ii) ICD 2 was missing, thus it is 1. ICD 3 was a negative diagnosis, thus it is a 0. iii) ICD 2 was a negative diagnosis, thus it is a 0. ICD 3 was missing, thus it is 1. iv) Both ICD 2 and 3 are missing, thus they both are 1. C) Our proposed approach, Denoise2Impute-T, comprises 2 components. A patient EHR is input into a Set Transformer based denoiser that outputs a denoised patient EHR. The EHR is also input into a Thresholding network that outputs thresholds for each ICD code. A greater than or equal to element-wise comparison is performed between the denoised EHR and the thresholds, that leads to the final imputed EHR.
  • Figure 2: Descriptive statistics for both datasets used in this paper. A) Comparison of the prevalence of ICD codes in $\mathcal{D}_1$ vs $\mathcal{D}_2$. B) Comparison of the sparsity of patient EHRs in $\mathcal{D}_1$ vs $\mathcal{D}_2$. C) The eigenvalue spectra of the covariance matrix of $\mathcal{D}_1$ (blue) with respect to 100 prevalence-matched random binary matrices (red). D) The eigenvalue spectra of the covariance matrix of $\mathcal{D}_2$ (blue) with respect to 100 prevalence-matched random binary matrices (red).
  • Figure 3: Dimension-wise AUPRC for $T=993$ computed on the test data. A) Comparison between AUPRC of the Denoise2Impute and the column mean (prevalence) for each dimension of the noisy data $\mathcal{D}_1$. B) Comparison between AUPRC of the Denoise2Impute and that of the $\mathcal{D}_1$. C) Comparison between AUPRC of the Denoise2Impute and AUPRC of the MLP.
  • Figure 4: Results for ICD code prediction for six common chronic conditions (A) and eight randomly chosen ICD codes (B). Results for Hospital Readmission task (C). Difference in AUPRC relative to using $\mathcal{D}_1$ and 95% confidence intervals are plotted. The common chronic conditions in (A) and the description of ICD codes in (B) are provided in Table \ref{['app:tab:eight_disease_captions']} in Appendix \ref{['app:icd_details']}.
  • Figure 5: Neural Network Architecture of the denoising model $g_{\bm{\theta}}$, where one of the key modules is the set attention block (SAB). The tensor shape of outputs in each layer is provided. We apply $L$ layers of the SAB.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 2
  • proof