Open (Clinical) LLMs are Sensitive to Instruction Phrasings

Alberto Mario Ceballos Arroyo; Monica Munnangi; Jiuding Sun; Karen Y. C. Zhang; Denis Jered McInerney; Byron C. Wallace; Silvio Amir

Open (Clinical) LLMs are Sensitive to Instruction Phrasings

Alberto Mario Ceballos Arroyo, Monica Munnangi, Jiuding Sun, Karen Y. C. Zhang, Denis Jered McInerney, Byron C. Wallace, Silvio Amir

TL;DR

Open (Clinical) LLMs are Sensitive to Instruction Phrasings investigates how instruction-tuned LLMs' performance on clinical tasks varies with differently phrased prompts. The authors collect clinician-written prompts for ten classification and six information-extraction tasks from MIMIC-III and i2b2/n2c2 datasets and evaluate seven LLMs (general and clinical) in zero-shot, chunked-note settings, measuring $AUROC$ and $F1$ as appropriate. They find substantial sensitivity to prompt phrasing across models and tasks, with fairness disparities across race and sex; surprisingly, general-domain models often outperform clinical-domain models, possibly due to the training data. The study emphasizes caution in deploying instruction-tuned LLMs in healthcare and contributes datasets, prompts, and code to spur further robustness research.

Abstract

Instruction-tuned Large Language Models (LLMs) can perform a wide range of tasks given natural language instructions to do so, but they are sensitive to how such instructions are phrased. This issue is especially concerning in healthcare, as clinicians are unlikely to be experienced prompt engineers and the potential consequences of inaccurate outputs are heightened in this domain. This raises a practical question: How robust are instruction-tuned LLMs to natural variations in the instructions provided for clinical NLP tasks? We collect prompts from medical doctors across a range of tasks and quantify the sensitivity of seven LLMs -- some general, others specialized -- to natural (i.e., non-adversarial) instruction phrasings. We find that performance varies substantially across all models, and that -- perhaps surprisingly -- domain-specific models explicitly trained on clinical data are especially brittle, compared to their general domain counterparts. Further, arbitrary phrasing differences can affect fairness, e.g., valid but distinct instructions for mortality prediction yield a range both in overall performance, and in terms of differences between demographic groups.

Open (Clinical) LLMs are Sensitive to Instruction Phrasings

TL;DR

and

as appropriate. They find substantial sensitivity to prompt phrasing across models and tasks, with fairness disparities across race and sex; surprisingly, general-domain models often outperform clinical-domain models, possibly due to the training data. The study emphasizes caution in deploying instruction-tuned LLMs in healthcare and contributes datasets, prompts, and code to spur further robustness research.

Abstract

Paper Structure (25 sections, 19 figures, 4 tables)

This paper contains 25 sections, 19 figures, 4 tables.

Introduction
Experimental Framework
Tasks and Datasets
MIMIC-III Johnson2016MIMICIIIAF
n2c2 2018 Cohort Selection Challenge 1496751
i2b2 2008 Obesity Challenge 1498659
n2c2 2018 Adverse Drug Events and Medication Extraction in EHRs henry2018N2c2Shared2020a
i2b2 2014 Identifying Risk Factors for Heart Disease over Time stubbsIdentifyingRiskFactors2015:
i2b2 2010 Relations Challenge uzuner2010I2b2VA2011a
i2b2 2009 Medication Extraction Challenge patrickHighAccuracyInformation2010
Instruction Collection
Models
Evaluation:
Results
Fairness
...and 10 more sections

Figures (19)

Figure 1: How much does LLM performance on clinical tasks depend on the arbitrary phrasings of instructions? Here we show an illustrative example: Discrepancy in AUROC score for Clinical Camel on the cohort selection-alcohol abuse classification task, when given the worst (A) and the best (B) performing prompts for Alcohol-Abuse classification task.
Figure 2: Variance in performance for clinical classification and information extraction tasks for each model. We show the distribution of deltas between the best and worst performing prompt for each task.
Figure 3: Variability in performance across prompts for the mortality prediction and drug extraction tasks. For most models, different but semantically equivalent prompts yield quite a range of performance.
Figure 4: Average AUROC across classification tasks given the best, median, and worst-performing prompts for each model.
Figure 5: Average F1 across extraction tasks given the best, median, and worst-performing prompts for each model.
...and 14 more figures

Open (Clinical) LLMs are Sensitive to Instruction Phrasings

TL;DR

Abstract

Open (Clinical) LLMs are Sensitive to Instruction Phrasings

Authors

TL;DR

Abstract

Table of Contents

Figures (19)