Table of Contents
Fetching ...

Learning Representations from Incomplete EHR Data with Dual-Masked Autoencoding

Xiao Xiang, David Restrepo, Hyewon Jeong, Yugang Jia, Leo Anthony Celi

TL;DR

This work tackles learning from irregular, sparsely observed EHR time series by proposing AID-MAE, a dual-masked autoencoder that leverages an intrinsic missingness mask and an augmented mask to train on incomplete tables without explicit imputation. The model employs a Transformer-based encoder–decoder on fixed-length grids with value-time embeddings and a dual reconstruction loss, demonstrating state-of-the-art performance on mortality, LOS, and AKI tasks across MIMIC-IV and PhysioNet 2012. Pretrained embeddings transfer strongly in low-label regimes and reveal clinically coherent feature organization and patient subtyping, indicating robust, generalizable representations. The results support dual masking as a scalable approach for tabular EHR representations, with potential extensions to explicit missing-not-at-random modeling and multimodal data.

Abstract

Learning from electronic health records (EHRs) time series is challenging due to irregular sam- pling, heterogeneous missingness, and the resulting sparsity of observations. Prior self-supervised meth- ods either impute before learning, represent missingness through a dedicated input signal, or optimize solely for imputation, reducing their capacity to efficiently learn representations that support clinical downstream tasks. We propose the Augmented-Intrinsic Dual-Masked Autoencoder (AID-MAE), which learns directly from incomplete time series by applying an intrinsic missing mask to represent naturally missing values and an augmented mask that hides a subset of observed values for reconstruction during training. AID-MAE processes only the unmasked subset of tokens and consistently outperforms strong baselines, including XGBoost and DuETT, across multiple clinical tasks on two datasets. In addition, the learned embeddings naturally stratify patient cohorts in the representation space.

Learning Representations from Incomplete EHR Data with Dual-Masked Autoencoding

TL;DR

This work tackles learning from irregular, sparsely observed EHR time series by proposing AID-MAE, a dual-masked autoencoder that leverages an intrinsic missingness mask and an augmented mask to train on incomplete tables without explicit imputation. The model employs a Transformer-based encoder–decoder on fixed-length grids with value-time embeddings and a dual reconstruction loss, demonstrating state-of-the-art performance on mortality, LOS, and AKI tasks across MIMIC-IV and PhysioNet 2012. Pretrained embeddings transfer strongly in low-label regimes and reveal clinically coherent feature organization and patient subtyping, indicating robust, generalizable representations. The results support dual masking as a scalable approach for tabular EHR representations, with potential extensions to explicit missing-not-at-random modeling and multimodal data.

Abstract

Learning from electronic health records (EHRs) time series is challenging due to irregular sam- pling, heterogeneous missingness, and the resulting sparsity of observations. Prior self-supervised meth- ods either impute before learning, represent missingness through a dedicated input signal, or optimize solely for imputation, reducing their capacity to efficiently learn representations that support clinical downstream tasks. We propose the Augmented-Intrinsic Dual-Masked Autoencoder (AID-MAE), which learns directly from incomplete time series by applying an intrinsic missing mask to represent naturally missing values and an augmented mask that hides a subset of observed values for reconstruction during training. AID-MAE processes only the unmasked subset of tokens and consistently outperforms strong baselines, including XGBoost and DuETT, across multiple clinical tasks on two datasets. In addition, the learned embeddings naturally stratify patient cohorts in the representation space.
Paper Structure (42 sections, 8 equations, 6 figures, 11 tables)

This paper contains 42 sections, 8 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: AID-MAE (Augmented-Intrinsic Dual-Masked AutoEncoder) A subset of observed tokens of the original data with inherent missingness MISSING is randomly masked ([MASK]). Each measurement (value and timestamp) is embedded with positional encodings. The encoder processes only unmasked tokens, and the decoder receives the encoded representations along with a learned padding token [PAD] in place of missing or masked entries. Training optimizes a dual loss of reconstructing unmasked values and predicting features under augmented masking, while intrinsically missing entries are excluded from the loss.
  • Figure 2: Linear probing results for mortality prediction and length of stay prediction tasks. We compare our model against Logistic Regression with raw data (median imputed) and DuETT with frozen encoder, across different training data percentages (1%, 5%, 10%, 50%, 100%). Error bars represent standard deviation across 5 seeds.
  • Figure 3: UMAP of feature embeddings. UMAP projection of $N=100{,}000$ randomly sampled 64-D embeddings for 50 lab features. Colors denote 13 highlighted lab types. Each point corresponds to one measurement embedding. We denote two important patterns: Bottom-left: neighboring pair of pink (Platelet Count) and green islands (WBC); Bottom-right: neighboring pair of purple (Hemoglobin) and green islands (Hematocrit). The geometrical neighboring is consistent with their clinical coupling.
  • Figure 4: UMAP visualization of first-day CLS embeddings for initial MICU and CVICU admissions. Colors represent clusters from K-means ($k=2$) applied in the embedding space.
  • Figure 5: Linear probing results for mortality prediction and length of stay prediction tasks. We compare our model against Logistic Regression with median imputation and DuETT across different training data percentages (1%, 5%, 10%, 50%, 100%). Results are shown for both AUROC (top row) and AUPRC (bottom row) metrics. Our model consistently outperforms baseline methods across all data regimes and tasks, with particularly strong performance in low-data scenarios. Error bars represent standard deviation across 5 random seeds. The x-axis uses logarithmic scaling to better visualize performance across the range of training data percentages.
  • ...and 1 more figures