When Raw Data Prevails: Are Large Language Model Embeddings Effective in Numerical Data Representation for Medical Machine Learning Applications?

Yanjun Gao; Skatje Myers; Shan Chen; Dmitriy Dligach; Timothy A Miller; Danielle Bitterman; Matthew Churpek; Majid Afshar

When Raw Data Prevails: Are Large Language Model Embeddings Effective in Numerical Data Representation for Medical Machine Learning Applications?

Yanjun Gao, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy A Miller, Danielle Bitterman, Matthew Churpek, Majid Afshar

TL;DR

This study probes whether zero-shot embeddings from last-hidden states of open-source LLMs can effectively represent numerical EHR data for medical predictions, comparing them to traditional raw-data features fed to ML models like XGBoost. It systematically investigates table-to-text conversion formats, embedding extraction methods, prompt design, few-shot data, and parameter-efficient tuning (QLoRA) across two clinically important tasks derived from EHRs: diagnosis prediction and mortality/LOS forecasting, using datasets including a 660-patient diagnosis cohort and MIMIC-Extract. The findings indicate that raw EHR features generally outperform LLM embeddings, though zero-shot embeddings can achieve competitive performance on several tasks and may offer deployment advantages; embedding-based approaches consistently underperform direct LLM generation for binary clinical predictions. The work highlights the need for improved time-varying feature representations, better prompt strategies, and potential benefits from multi-modal integration, while underscoring the substantial computational demands and ethical considerations associated with LLM-based medical decision support.

Abstract

The introduction of Large Language Models (LLMs) has advanced data representation and analysis, bringing significant progress in their use for medical questions and answering. Despite these advancements, integrating tabular data, especially numerical data pivotal in clinical contexts, into LLM paradigms has not been thoroughly explored. In this study, we examine the effectiveness of vector representations from last hidden states of LLMs for medical diagnostics and prognostics using electronic health record (EHR) data. We compare the performance of these embeddings with that of raw numerical EHR data when used as feature inputs to traditional machine learning (ML) algorithms that excel at tabular data learning, such as eXtreme Gradient Boosting. We focus on instruction-tuned LLMs in a zero-shot setting to represent abnormal physiological data and evaluating their utilities as feature extractors to enhance ML classifiers for predicting diagnoses, length of stay, and mortality. Furthermore, we examine prompt engineering techniques on zero-shot and few-shot LLM embeddings to measure their impact comprehensively. Although findings suggest the raw data features still prevails in medical ML tasks, zero-shot LLM embeddings demonstrate competitive results, suggesting a promising avenue for future research in medical applications.

When Raw Data Prevails: Are Large Language Model Embeddings Effective in Numerical Data Representation for Medical Machine Learning Applications?

TL;DR

Abstract

Paper Structure (28 sections, 6 figures, 15 tables)

This paper contains 28 sections, 6 figures, 15 tables.

Introduction
Related Work
Datasets and Tasks
Diagnosis prediction for clinical deterioration
Mortality and length-of-stay prediction
Methods and Experiment Setup
Table-to-text conversion
Embedding extraction methods
Selection of LLMs
Prompt design and few-shot learning
Parameter efficient fine-tuning
Experiment setup
Results
Main results for diagnosis prediction
Main results for mortality prediction and length-of-stay
...and 13 more sections

Figures (6)

Figure 1: Physician Evaluation of LLMs' Knowledge on Normal Vital Sign and Lab Test Values. This experiment probes Mistral-7B-Instruct and Llama2-13B-Chat on reference ranges for 24 vital signs and lab tests. Results show these models have a strong understanding of normal medical values, crucial for clinical applications. Table \ref{['tab:feat_and_templates']} listed all 24 feature names, and more output examples are in Appendix \ref{['sec:prob_examples']}.
Figure 2: This study investigates the feasibility of using LLM embeddings for numerical EHR data features representation in medical machine learning applications. To use LLMs, raw features are transformed into queries via templates. Under a zero-shot setting, these queries are encoded into embeddings for ML classification. We explore the effects of prompt engineering, few-shot learning using synthetic data generation, and parameter efficient tuning on LLM embeddings.
Figure 3: Accuracy (left) and AUROC (right) for in-ICU mortality (mort ICU), in-Hospital morality (mort Hosp), hospital LOS exceeding 3 days (LOS 3) and 7 days (LOS 7). The Logistic Regression (LR) and Random Forest (RF) baselines are reported from wang2020mimic. The LLM results are from LLM embeddings + XGB settings. The CIs mostly overlap; for clarity in presentation, they were omitted from this figure.
Figure 4: Comparison across different embedding methods and different format on the Diagnosis dataset. For simplicity, we used Narrative and max pooling for the other analysis after this section.
Figure 5: Confusion matrices for Mistral prediction on LOS 3 and Mort ICU tasks. Right: Mistral without QLoRA; left: Mistral after QLoRA.
...and 1 more figures

When Raw Data Prevails: Are Large Language Model Embeddings Effective in Numerical Data Representation for Medical Machine Learning Applications?

TL;DR

Abstract

When Raw Data Prevails: Are Large Language Model Embeddings Effective in Numerical Data Representation for Medical Machine Learning Applications?

Authors

TL;DR

Abstract

Table of Contents

Figures (6)