Table of Contents
Fetching ...

TRIAGE: Ethical Benchmarking of AI Models Through Mass Casualty Simulations

Nathalie Maria Kirch, Konstantin Hebenstreit, Matthias Samwald

Abstract

We present the TRIAGE Benchmark, a novel machine ethics (ME) benchmark that tests LLMs' ability to make ethical decisions during mass casualty incidents. It uses real-world ethical dilemmas with clear solutions designed by medical professionals, offering a more realistic alternative to annotation-based benchmarks. TRIAGE incorporates various prompting styles to evaluate model performance across different contexts. Most models consistently outperformed random guessing, suggesting LLMs may support decision-making in triage scenarios. Neutral or factual scenario formulations led to the best performance, unlike other ME benchmarks where ethical reminders improved outcomes. Adversarial prompts reduced performance but not to random guessing levels. Open-source models made more morally serious errors, and general capability overall predicted better performance.

TRIAGE: Ethical Benchmarking of AI Models Through Mass Casualty Simulations

Abstract

We present the TRIAGE Benchmark, a novel machine ethics (ME) benchmark that tests LLMs' ability to make ethical decisions during mass casualty incidents. It uses real-world ethical dilemmas with clear solutions designed by medical professionals, offering a more realistic alternative to annotation-based benchmarks. TRIAGE incorporates various prompting styles to evaluate model performance across different contexts. Most models consistently outperformed random guessing, suggesting LLMs may support decision-making in triage scenarios. Neutral or factual scenario formulations led to the best performance, unlike other ME benchmarks where ethical reminders improved outcomes. Adversarial prompts reduced performance but not to random guessing levels. Open-source models made more morally serious errors, and general capability overall predicted better performance.

Paper Structure

This paper contains 24 sections, 1 equation, 14 figures, 11 tables.

Figures (14)

  • Figure 1: START triage categories with example patient descriptions. Categories are ordered from least (green) to most (red) resource intensive.
  • Figure 2: (a) Proportions of Correct Answers of Models on the TRIAGE Dataset, and (b) Best and Worst-Case Performance of Models, showing how the ordering of models changes depending on the scenario.
  • Figure 3: Pairwise Mixed Model Comparisons. Red stars indicate significance level (* : p < 0.04, ** : p < 0.01, *** : p < 0.001). The red dashed line indicates random guessing (25$\%$). The blue dashed line indicates the value of the intercept. The intercept represents an estimate of all predictor variables at their reference levels (no ethics prompt, neutral syntax, and weaker model category). This value is the baseline outcome before accounting for the influence of the predictors and the random effects. The significance of all other factors is compared to the intercept value. Estimates in mixed effects models are typically in logits. We converted estimates to proportions in this figure.
  • Figure 4: Comparison of average error patterns. Overcaring errors are more numerous than undercaring errors except for Mistral and Mixtral.
  • Figure 5: Example question with neutral syntax (see Fig. \ref{['fig:syntax_variations_neutral']}) and no ethics prompt. Full prompts can be found in Appendix \ref{['app:example-dialogues']} GPT-3.5-Turbo, answers correctly (Correct Answer: Red/Immediate)
  • ...and 9 more figures