Table of Contents
Fetching ...

Labrador: Exploring the Limits of Masked Language Modeling for Laboratory Data

David R. Bellamy, Bhawesh Kumar, Cindy Wang, Andrew Beam

TL;DR

Labrador introduces a continuous Transformer architecture pretrained on a large corpus of lab measurements to learn representations from numeric EHR data. Despite strong intrinsic pre-training performance and effective lab value imputation, transfer learning to downstream clinical tasks yields limited gains, with XGBoost often outperforming the transformers. The study finds Labrador generally outperforms a BERT baseline but still struggles to surpass traditional tree-based methods, highlighting data-scale and data-generating-process limitations. The authors advocate multimodal, multimethod modeling and larger, harmonized datasets to realize the potential of foundation models for numerical EHR data.

Abstract

In this work we introduce Labrador, a pre-trained Transformer model for laboratory data. Labrador and BERT were pre-trained on a corpus of 100 million lab test results from electronic health records (EHRs) and evaluated on various downstream outcome prediction tasks. Both models demonstrate mastery of the pre-training task but neither consistently outperform XGBoost on downstream supervised tasks. Our ablation studies reveal that transfer learning shows limited effectiveness for BERT and achieves marginal success with Labrador. We explore the reasons for the failure of transfer learning and suggest that the data generating process underlying each patient cannot be characterized sufficiently using labs alone, among other factors. We encourage future work to focus on joint modeling of multiple EHR data categories and to include tree-based baselines in their evaluations.

Labrador: Exploring the Limits of Masked Language Modeling for Laboratory Data

TL;DR

Labrador introduces a continuous Transformer architecture pretrained on a large corpus of lab measurements to learn representations from numeric EHR data. Despite strong intrinsic pre-training performance and effective lab value imputation, transfer learning to downstream clinical tasks yields limited gains, with XGBoost often outperforming the transformers. The study finds Labrador generally outperforms a BERT baseline but still struggles to surpass traditional tree-based methods, highlighting data-scale and data-generating-process limitations. The authors advocate multimodal, multimethod modeling and larger, harmonized datasets to realize the potential of foundation models for numerical EHR data.

Abstract

In this work we introduce Labrador, a pre-trained Transformer model for laboratory data. Labrador and BERT were pre-trained on a corpus of 100 million lab test results from electronic health records (EHRs) and evaluated on various downstream outcome prediction tasks. Both models demonstrate mastery of the pre-training task but neither consistently outperform XGBoost on downstream supervised tasks. Our ablation studies reveal that transfer learning shows limited effectiveness for BERT and achieves marginal success with Labrador. We explore the reasons for the failure of transfer learning and suggest that the data generating process underlying each patient cannot be characterized sufficiently using labs alone, among other factors. We encourage future work to focus on joint modeling of multiple EHR data categories and to include tree-based baselines in their evaluations.
Paper Structure (37 sections, 8 figures, 13 tables)

This paper contains 37 sections, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Labrador model architecture.
  • Figure 2: UMAP Visualization of Labrador and BERT Embeddings. A. A global view of the embedding space structure for Labrador (left) and BERT (right). The 70 most frequently ordered lab tests are shown and colored according to the panel of tests they are typically ordered with (see Appendix \ref{['appendix_labpanels']} for all panel definitions). All labs on this figure are from the test split and were not seen during pre-training. B. Visualization of embeddings for four routinely collected laboratory measurements colored by lab value and scaled to the interval [0, 1]. Labrador appears to encode the measured lab value in a much more natural way with a smooth gradient for lab value compared to BERT.
  • Figure 3: Intrinsic Evaluation of Labrador and BERT by lab value imputation. A. We masked a random lab from each bag in the test set of the pre-training data and imputed these values using pre-trained Labrador (left) and BERT (right). Both Labrador and BERT achieve a Pearson correlation $r^2 > 0.8$, in contrast to their ablations (orange). B. Imputations for the four best lab tests as measured by Pearson correlation. C. Imputations for the four worst lab tests as measured by Pearson correlation.
  • Figure 4: Pre-training loss for BERT with 68.5M ($d_k = \left\lfloor \frac{d_{\text{model}}}{h}\right\rfloor$) versus 194M parameters ($d_k = d_{\text{model}}$). Top: The training loss for both BERT models. Bottom: The validation loss.
  • Figure 5: Lab value imputations from BERT with 68.5M ($d_k = \left\lfloor \frac{d_{\text{model}}}{h}\right\rfloor$) versus 194M parameters ($d_k = d_{\text{model}}$). A. Imputations from both pre-trained BERT models (blue) as well as their ablations (orange) on the test split of the pre-training data. B. Imputations for the four best lab tests as measured by Pearson correlation. C. Imputations for the four worst lab tests as measured by Pearson correlation.
  • ...and 3 more figures