Table of Contents
Fetching ...

Syn-STARTS: Synthesized START Triage Scenario Generation Framework for Scalable LLM Evaluation

Chiharu Hagiwara, Naoki Nonaka, Yuhta Hashimoto, Ryu Uchimido, Jun Seita

TL;DR

This work introduces Syn-STARTS, a framework that uses a generation-then-validation pipeline to synthesize START-based triage cases for scalable LLM evaluation in medical high-stakes scenarios. By constructing a large, balanced corpus and validating cases through START consistency, medical plausibility, and narrative coherence, the authors demonstrate that synthetic data can closely match expert-authored benchmarks in perceptual realism and fidelity (Pearson correlation $r = 0.92$, $p < 0.01$) while enabling extensive diagnostic analyses. The study reveals that dataset composition and scale meaningfully influence evaluation outcomes, with larger synthetic datasets reducing performance variance and exposing model-specific error patterns. Overall, Syn-STARTS offers a scalable, privacy-preserving benchmark capable of driving thorough, multi-faceted evaluation of LLMs in triage and potentially other clinical decision tasks, while outlining future work on dynamic MCIs, fairness, and narrative diversification.

Abstract

Triage is a critically important decision-making process in mass casualty incidents (MCIs) to maximize victim survival rates. While the role of AI in such situations is gaining attention for making optimal decisions within limited resources and time, its development and performance evaluation require benchmark datasets of sufficient quantity and quality. However, MCIs occur infrequently, and sufficient records are difficult to accumulate at the scene, making it challenging to collect large-scale realworld data for research use. Therefore, we developed Syn-STARTS, a framework that uses LLMs to generate triage cases, and verified its effectiveness. The results showed that the triage cases generated by Syn-STARTS were qualitatively indistinguishable from the TRIAGE open dataset generated by manual curation from training materials. Furthermore, when evaluating the LLM accuracy using hundreds of cases each from the green, yellow, red, and black categories defined by the standard triage method START, the results were found to be highly stable. This strongly indicates the possibility of synthetic data in developing high-performance AI models for severe and critical medical situations.

Syn-STARTS: Synthesized START Triage Scenario Generation Framework for Scalable LLM Evaluation

TL;DR

This work introduces Syn-STARTS, a framework that uses a generation-then-validation pipeline to synthesize START-based triage cases for scalable LLM evaluation in medical high-stakes scenarios. By constructing a large, balanced corpus and validating cases through START consistency, medical plausibility, and narrative coherence, the authors demonstrate that synthetic data can closely match expert-authored benchmarks in perceptual realism and fidelity (Pearson correlation , ) while enabling extensive diagnostic analyses. The study reveals that dataset composition and scale meaningfully influence evaluation outcomes, with larger synthetic datasets reducing performance variance and exposing model-specific error patterns. Overall, Syn-STARTS offers a scalable, privacy-preserving benchmark capable of driving thorough, multi-faceted evaluation of LLMs in triage and potentially other clinical decision tasks, while outlining future work on dynamic MCIs, fairness, and narrative diversification.

Abstract

Triage is a critically important decision-making process in mass casualty incidents (MCIs) to maximize victim survival rates. While the role of AI in such situations is gaining attention for making optimal decisions within limited resources and time, its development and performance evaluation require benchmark datasets of sufficient quantity and quality. However, MCIs occur infrequently, and sufficient records are difficult to accumulate at the scene, making it challenging to collect large-scale realworld data for research use. Therefore, we developed Syn-STARTS, a framework that uses LLMs to generate triage cases, and verified its effectiveness. The results showed that the triage cases generated by Syn-STARTS were qualitatively indistinguishable from the TRIAGE open dataset generated by manual curation from training materials. Furthermore, when evaluating the LLM accuracy using hundreds of cases each from the green, yellow, red, and black categories defined by the standard triage method START, the results were found to be highly stable. This strongly indicates the possibility of synthetic data in developing high-performance AI models for severe and critical medical situations.

Paper Structure

This paper contains 30 sections, 13 figures, 2 tables, 1 algorithm.

Figures (13)

  • Figure 1: The structure of a Syn-STARTS case, comprising the tag, structured vitals, and narrative description. The vitals provide a verifiable intermediate layer that links the description to the triage tag via algorithmic confirmation.
  • Figure 2: The experimental datasets include the TRIAGE-adult dataset and Syn-STARTS datasets, variants configured by tag distribution and scale, with each configuration comprising ten non-overlapping replicate datasets without replacement. Hereafter, triage tag counts are presented in the fixed order of {Green, Yellow, Red, Black}.
  • Figure 3: Expert discrimination of synthetic Syn-STARTS cases from expert-authored vignettes. \ref{['fig:3a']}: Number of questions answered correctly by each expert; \ref{['fig:3b']}: Averaged confusion matrix from the three experts, indicating Syn-STARTS cases are difficult to distinguish from expert-authored cases.
  • Figure 4: Scatter plot showing the relationship in model accuracy across the TRIAGE-adult and Syn-STARTS datasets, with matched tag distribution ($n = 54$; $\{18, 11, 22, 3\}$). The strong linear relationship (Pearson’s $r = 0.92$) suggests that performance is largely preserved across them.
  • Figure 5: Accuracy distributions across Syn-STARTS datasets with "TRIAGE adult" tag distributions ($n=54$; $\{18, 11, 22, 3\}$) and uniform tag distributions ($n=56$; $\{14, 14, 14, 14\}$). The results reveal a model-dependent sensitivity to the dataset's composition. * denotes a statistically significant difference between the two distributions (Wilcoxon signed-rank test; GPT-3.5, $p < 0.01$; GPT-4, $p = 0.04$).
  • ...and 8 more figures