Table of Contents
Fetching ...

Count-Based Approaches Remain Strong: A Benchmark Against Transformer and LLM Pipelines on Structured EHR

Jifan Gao, Michael Rosenthal, Brian Wolpin, Simona Cristea

TL;DR

Count-based approaches using ontology roll-ups remain competitive for structured EHR prediction, even with transformer and mixture-of-agents pipelines. The study conducts a head-to-head benchmark across three model families on eight tasks from the EHRSHOT dataset, including two label schemes. Across tasks, count-based models and the MoA pipeline win roughly in tandem, while CLMBR typically lags, highlighting the enduring value of simple, interpretable, and data-efficient approaches. The findings underscore that traditional tabular methods are strong baselines for structured EHR, while MoA and related LLM-based methods can provide task-specific gains and interpretability advantages.

Abstract

Structured electronic health records (EHR) are essential for clinical prediction. While count-based learners continue to perform strongly on such data, no benchmarking has directly compared them against more recent mixture-of-agents LLM pipelines, which have been reported to outperform single LLMs in various NLP tasks. In this study, we evaluated three categories of methodologies for EHR prediction using the EHRSHOT dataset: count-based models built from ontology roll-ups with two time bins, based on LightGBM and the tabular foundation model TabPFN; a pretrained sequential transformer (CLMBR); and a mixture-of-agents pipeline that converts tabular histories to natural-language summaries followed by a text classifier. We assessed eight outcomes using the EHRSHOT dataset. Across the eight evaluation tasks, head-to-head wins were largely split between the count-based and the mixture-of-agents methods. Given their simplicity and interpretability, count-based models remain a strong candidate for structured EHR benchmarking. The source code is available at: https://github.com/cristea-lab/Structured_EHR_Benchmark.

Count-Based Approaches Remain Strong: A Benchmark Against Transformer and LLM Pipelines on Structured EHR

TL;DR

Count-based approaches using ontology roll-ups remain competitive for structured EHR prediction, even with transformer and mixture-of-agents pipelines. The study conducts a head-to-head benchmark across three model families on eight tasks from the EHRSHOT dataset, including two label schemes. Across tasks, count-based models and the MoA pipeline win roughly in tandem, while CLMBR typically lags, highlighting the enduring value of simple, interpretable, and data-efficient approaches. The findings underscore that traditional tabular methods are strong baselines for structured EHR, while MoA and related LLM-based methods can provide task-specific gains and interpretability advantages.

Abstract

Structured electronic health records (EHR) are essential for clinical prediction. While count-based learners continue to perform strongly on such data, no benchmarking has directly compared them against more recent mixture-of-agents LLM pipelines, which have been reported to outperform single LLMs in various NLP tasks. In this study, we evaluated three categories of methodologies for EHR prediction using the EHRSHOT dataset: count-based models built from ontology roll-ups with two time bins, based on LightGBM and the tabular foundation model TabPFN; a pretrained sequential transformer (CLMBR); and a mixture-of-agents pipeline that converts tabular histories to natural-language summaries followed by a text classifier. We assessed eight outcomes using the EHRSHOT dataset. Across the eight evaluation tasks, head-to-head wins were largely split between the count-based and the mixture-of-agents methods. Given their simplicity and interpretability, count-based models remain a strong candidate for structured EHR benchmarking. The source code is available at: https://github.com/cristea-lab/Structured_EHR_Benchmark.

Paper Structure

This paper contains 13 sections, 5 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: Study design and benchmarking pipeline. We evaluated three modeling categories on the EHRSHOT benchmark dataset, with OMOP-standardized cohorts and eight tasks: Long Length of Stay (LOS), ICU Transfer, Readmission, Pancreatic Cancer, Hypertension, Acute Myocardial Infarction (MI), and Hyperlipidemia. From the OMOP tables, we construct: (i) a count-based tabular pipeline that aggregates ICD-10 and ATC codes over a 1-year look-back window and trains strong tabular models (LightGBM and TabPFN); (ii) a pretrained sequential model (CLMBR) that tokenizes time-stamped OMOP concepts and learns vector representations for downstream prediction, introduced with the EHRSHOT dataset; and (iii) a MoA LLM pipeline that converts each patient’s longitudinal record into a concise natural-language summary before classification. All models use the same cohorts, prediction windows, and splits; performance is compared on held-out data.
  • Figure 2: Performance of the benchmarked models across the eight prediction tasks in EHRSHOT under two label definitions (earliest and latest). Top panels show AUROC; bottom panels show AUPR. Bolded bars indicate the best-performing model for each task; the white dotted line in each AUPR panel marks the outcome prevalence. Across tasks, wins are shared mainly between the count-based methods and the MoA pipeline, with count-based methods holding a slight overall edge and generally outperforming CLMBR.
  • Figure 3: Task-adaptive emphasis in MoA intermediate summaries. Bars show the change in ICD-10 chapter mention percentages between the MoA summary and the original structured EHR ($\Delta\%\!=$ percentage mentions in summary $-$ percentage mentions in original), computed with MedCat concept extraction. Positive values indicate chapters the summary amplifies; negative values indicate down-weighting. Chapters with opposite sign directions illustrate adaptation to the prediction task. For example, Endocrine/Metabolic is amplified for pancreatic cancer but reduced for ICU transfer, whereas Nervous system is boosted for ICU transfer and de-emphasized for pancreatic cancer.
  • Figure 4: Case study: LLM-generated intermediate summary from structured EHR in the MoA pipeline. Structured EHR is first converted into an "EVENT at AGE" text sequence and further passed to Qwen with a task prompt to predict the risk of a new pancreatic cancer diagnosis in the next year. The model is instructed to return a constrained JSON object containing risk_category (Low/Moderate/High), risk_score$[0,1]$, drivers_positive/drivers_negative from the input, a 2 to 4 sentence justification that references evidence and timing, and an insufficient_evidence flag. In this example, Qwen highlights biliary obstruction, cholangitis, and abdominal pain, and returns a Moderate risk with score 0.5, while also yielding an interpretable intermediate summary used for downstream text classifiers in the MoA pipeline.
  • Figure 5: SHAP analysis of LightGBM models for four prediction tasks. For each task, we show the mean absolute SHAP values on the test set for the top five most influential features. All of the contributing factors come from the most recent time bin (within one year before the prediction time). This detail is not explicitly labeled in the figure, as the concept names are already lengthy.