Table of Contents
Fetching ...

LLM-Assisted Emergency Triage Benchmark: Bridging Hospital-Rich and MCI-Like Field Simulation

Joshua Sebastian, Karma Tobden, KMA Solaiman

TL;DR

This work tackles the lack of open benchmarks for emergency triage by creating an open, LLM-assisted deterioration-prediction benchmark derived from MIMIC-IV-ED, with two regimes that mirror hospital-rich and field-like (MCI) contexts. It combines deterministic data curation, feature harmonization (including AVPU and oxygen logic) guided by large language models, and baseline predictive models evaluated with SHAP interpretability to reveal the driving factors for early deterioration. The results show that simple vitals and triage observations carry substantial predictive signal, with labs offering incremental gains primarily in richer hospital contexts, and that ensemble methods generally outperform linear models. By releasing preprocessing code, feature mappings, and data splits, the paper advances reproducibility and accessibility in clinical AI triage research, potentially accelerating benchmarking and integration of triage tools in both hospital and field scenarios.

Abstract

Research on emergency and mass casualty incident (MCI) triage has been limited by the absence of openly usable, reproducible benchmarks. Yet these scenarios demand rapid identification of the patients most in need, where accurate deterioration prediction can guide timely interventions. While the MIMIC-IV-ED database is openly available to credentialed researchers, transforming it into a triage-focused benchmark requires extensive preprocessing, feature harmonization, and schema alignment -- barriers that restrict accessibility to only highly technical users. We address these gaps by first introducing an open, LLM-assisted emergency triage benchmark for deterioration prediction (ICU transfer, in-hospital mortality). The benchmark then defines two regimes: (i) a hospital-rich setting with vitals, labs, notes, chief complaints, and structured observations, and (ii) an MCI-like field simulation limited to vitals, observations, and notes. Large language models (LLMs) contributed directly to dataset construction by (i) harmonizing noisy fields such as AVPU and breathing devices, (ii) prioritizing clinically relevant vitals and labs, and (iii) guiding schema alignment and efficient merging of disparate tables. We further provide baseline models and SHAP-based interpretability analyses, illustrating predictive gaps between regimes and the features most critical for triage. Together, these contributions make triage prediction research more reproducible and accessible -- a step toward dataset democratization in clinical AI.

LLM-Assisted Emergency Triage Benchmark: Bridging Hospital-Rich and MCI-Like Field Simulation

TL;DR

This work tackles the lack of open benchmarks for emergency triage by creating an open, LLM-assisted deterioration-prediction benchmark derived from MIMIC-IV-ED, with two regimes that mirror hospital-rich and field-like (MCI) contexts. It combines deterministic data curation, feature harmonization (including AVPU and oxygen logic) guided by large language models, and baseline predictive models evaluated with SHAP interpretability to reveal the driving factors for early deterioration. The results show that simple vitals and triage observations carry substantial predictive signal, with labs offering incremental gains primarily in richer hospital contexts, and that ensemble methods generally outperform linear models. By releasing preprocessing code, feature mappings, and data splits, the paper advances reproducibility and accessibility in clinical AI triage research, potentially accelerating benchmarking and integration of triage tools in both hospital and field scenarios.

Abstract

Research on emergency and mass casualty incident (MCI) triage has been limited by the absence of openly usable, reproducible benchmarks. Yet these scenarios demand rapid identification of the patients most in need, where accurate deterioration prediction can guide timely interventions. While the MIMIC-IV-ED database is openly available to credentialed researchers, transforming it into a triage-focused benchmark requires extensive preprocessing, feature harmonization, and schema alignment -- barriers that restrict accessibility to only highly technical users. We address these gaps by first introducing an open, LLM-assisted emergency triage benchmark for deterioration prediction (ICU transfer, in-hospital mortality). The benchmark then defines two regimes: (i) a hospital-rich setting with vitals, labs, notes, chief complaints, and structured observations, and (ii) an MCI-like field simulation limited to vitals, observations, and notes. Large language models (LLMs) contributed directly to dataset construction by (i) harmonizing noisy fields such as AVPU and breathing devices, (ii) prioritizing clinically relevant vitals and labs, and (iii) guiding schema alignment and efficient merging of disparate tables. We further provide baseline models and SHAP-based interpretability analyses, illustrating predictive gaps between regimes and the features most critical for triage. Together, these contributions make triage prediction research more reproducible and accessible -- a step toward dataset democratization in clinical AI.

Paper Structure

This paper contains 30 sections, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Section 3 pipeline with LLM-assisted harmonization annotated at the steps it influenced.
  • Figure 2: Comparison of model performance: (a) feature groups individually, (b) cumulative feature sets.
  • Figure 3: Global SHAP summaries for Random Forest across three regimes. (a) Hospital-rich features; (b) Cumulative feature addition (Vitals → Vitals+Obs → +Labs); (c) MCI-like reduced features. Each point represents a patient, colored by feature value (red = high, blue = low).
  • Figure 4: LLM-functionality focused curation pipeline. LLMs guided join keys/deduplication, AVPU mapping, respiratory harmonization (binary + multi-class), complaint parsing with synonyms/negation, readmission-aware first-vitals, and lightweight noise filters; models are evaluated under hospital-rich and MCI-like regimes with SHAP-based explanations.