Table of Contents
Fetching ...

Open (Clinical) LLMs are Sensitive to Instruction Phrasings

Alberto Mario Ceballos Arroyo, Monica Munnangi, Jiuding Sun, Karen Y. C. Zhang, Denis Jered McInerney, Byron C. Wallace, Silvio Amir

TL;DR

Open (Clinical) LLMs are Sensitive to Instruction Phrasings investigates how instruction-tuned LLMs' performance on clinical tasks varies with differently phrased prompts. The authors collect clinician-written prompts for ten classification and six information-extraction tasks from MIMIC-III and i2b2/n2c2 datasets and evaluate seven LLMs (general and clinical) in zero-shot, chunked-note settings, measuring $AUROC$ and $F1$ as appropriate. They find substantial sensitivity to prompt phrasing across models and tasks, with fairness disparities across race and sex; surprisingly, general-domain models often outperform clinical-domain models, possibly due to the training data. The study emphasizes caution in deploying instruction-tuned LLMs in healthcare and contributes datasets, prompts, and code to spur further robustness research.

Abstract

Instruction-tuned Large Language Models (LLMs) can perform a wide range of tasks given natural language instructions to do so, but they are sensitive to how such instructions are phrased. This issue is especially concerning in healthcare, as clinicians are unlikely to be experienced prompt engineers and the potential consequences of inaccurate outputs are heightened in this domain. This raises a practical question: How robust are instruction-tuned LLMs to natural variations in the instructions provided for clinical NLP tasks? We collect prompts from medical doctors across a range of tasks and quantify the sensitivity of seven LLMs -- some general, others specialized -- to natural (i.e., non-adversarial) instruction phrasings. We find that performance varies substantially across all models, and that -- perhaps surprisingly -- domain-specific models explicitly trained on clinical data are especially brittle, compared to their general domain counterparts. Further, arbitrary phrasing differences can affect fairness, e.g., valid but distinct instructions for mortality prediction yield a range both in overall performance, and in terms of differences between demographic groups.

Open (Clinical) LLMs are Sensitive to Instruction Phrasings

TL;DR

Open (Clinical) LLMs are Sensitive to Instruction Phrasings investigates how instruction-tuned LLMs' performance on clinical tasks varies with differently phrased prompts. The authors collect clinician-written prompts for ten classification and six information-extraction tasks from MIMIC-III and i2b2/n2c2 datasets and evaluate seven LLMs (general and clinical) in zero-shot, chunked-note settings, measuring and as appropriate. They find substantial sensitivity to prompt phrasing across models and tasks, with fairness disparities across race and sex; surprisingly, general-domain models often outperform clinical-domain models, possibly due to the training data. The study emphasizes caution in deploying instruction-tuned LLMs in healthcare and contributes datasets, prompts, and code to spur further robustness research.

Abstract

Instruction-tuned Large Language Models (LLMs) can perform a wide range of tasks given natural language instructions to do so, but they are sensitive to how such instructions are phrased. This issue is especially concerning in healthcare, as clinicians are unlikely to be experienced prompt engineers and the potential consequences of inaccurate outputs are heightened in this domain. This raises a practical question: How robust are instruction-tuned LLMs to natural variations in the instructions provided for clinical NLP tasks? We collect prompts from medical doctors across a range of tasks and quantify the sensitivity of seven LLMs -- some general, others specialized -- to natural (i.e., non-adversarial) instruction phrasings. We find that performance varies substantially across all models, and that -- perhaps surprisingly -- domain-specific models explicitly trained on clinical data are especially brittle, compared to their general domain counterparts. Further, arbitrary phrasing differences can affect fairness, e.g., valid but distinct instructions for mortality prediction yield a range both in overall performance, and in terms of differences between demographic groups.
Paper Structure (25 sections, 19 figures, 4 tables)

This paper contains 25 sections, 19 figures, 4 tables.

Figures (19)

  • Figure 1: How much does LLM performance on clinical tasks depend on the arbitrary phrasings of instructions? Here we show an illustrative example: Discrepancy in AUROC score for Clinical Camel on the cohort selection-alcohol abuse classification task, when given the worst (A) and the best (B) performing prompts for Alcohol-Abuse classification task.
  • Figure 2: Variance in performance for clinical classification and information extraction tasks for each model. We show the distribution of deltas between the best and worst performing prompt for each task.
  • Figure 3: Variability in performance across prompts for the mortality prediction and drug extraction tasks. For most models, different but semantically equivalent prompts yield quite a range of performance.
  • Figure 4: Average AUROC across classification tasks given the best, median, and worst-performing prompts for each model.
  • Figure 5: Average F1 across extraction tasks given the best, median, and worst-performing prompts for each model.
  • ...and 14 more figures