Syn-STARTS: Synthesized START Triage Scenario Generation Framework for Scalable LLM Evaluation
Chiharu Hagiwara, Naoki Nonaka, Yuhta Hashimoto, Ryu Uchimido, Jun Seita
TL;DR
This work introduces Syn-STARTS, a framework that uses a generation-then-validation pipeline to synthesize START-based triage cases for scalable LLM evaluation in medical high-stakes scenarios. By constructing a large, balanced corpus and validating cases through START consistency, medical plausibility, and narrative coherence, the authors demonstrate that synthetic data can closely match expert-authored benchmarks in perceptual realism and fidelity (Pearson correlation $r = 0.92$, $p < 0.01$) while enabling extensive diagnostic analyses. The study reveals that dataset composition and scale meaningfully influence evaluation outcomes, with larger synthetic datasets reducing performance variance and exposing model-specific error patterns. Overall, Syn-STARTS offers a scalable, privacy-preserving benchmark capable of driving thorough, multi-faceted evaluation of LLMs in triage and potentially other clinical decision tasks, while outlining future work on dynamic MCIs, fairness, and narrative diversification.
Abstract
Triage is a critically important decision-making process in mass casualty incidents (MCIs) to maximize victim survival rates. While the role of AI in such situations is gaining attention for making optimal decisions within limited resources and time, its development and performance evaluation require benchmark datasets of sufficient quantity and quality. However, MCIs occur infrequently, and sufficient records are difficult to accumulate at the scene, making it challenging to collect large-scale realworld data for research use. Therefore, we developed Syn-STARTS, a framework that uses LLMs to generate triage cases, and verified its effectiveness. The results showed that the triage cases generated by Syn-STARTS were qualitatively indistinguishable from the TRIAGE open dataset generated by manual curation from training materials. Furthermore, when evaluating the LLM accuracy using hundreds of cases each from the green, yellow, red, and black categories defined by the standard triage method START, the results were found to be highly stable. This strongly indicates the possibility of synthetic data in developing high-performance AI models for severe and critical medical situations.
