Open (Clinical) LLMs are Sensitive to Instruction Phrasings
Alberto Mario Ceballos Arroyo, Monica Munnangi, Jiuding Sun, Karen Y. C. Zhang, Denis Jered McInerney, Byron C. Wallace, Silvio Amir
TL;DR
Open (Clinical) LLMs are Sensitive to Instruction Phrasings investigates how instruction-tuned LLMs' performance on clinical tasks varies with differently phrased prompts. The authors collect clinician-written prompts for ten classification and six information-extraction tasks from MIMIC-III and i2b2/n2c2 datasets and evaluate seven LLMs (general and clinical) in zero-shot, chunked-note settings, measuring $AUROC$ and $F1$ as appropriate. They find substantial sensitivity to prompt phrasing across models and tasks, with fairness disparities across race and sex; surprisingly, general-domain models often outperform clinical-domain models, possibly due to the training data. The study emphasizes caution in deploying instruction-tuned LLMs in healthcare and contributes datasets, prompts, and code to spur further robustness research.
Abstract
Instruction-tuned Large Language Models (LLMs) can perform a wide range of tasks given natural language instructions to do so, but they are sensitive to how such instructions are phrased. This issue is especially concerning in healthcare, as clinicians are unlikely to be experienced prompt engineers and the potential consequences of inaccurate outputs are heightened in this domain. This raises a practical question: How robust are instruction-tuned LLMs to natural variations in the instructions provided for clinical NLP tasks? We collect prompts from medical doctors across a range of tasks and quantify the sensitivity of seven LLMs -- some general, others specialized -- to natural (i.e., non-adversarial) instruction phrasings. We find that performance varies substantially across all models, and that -- perhaps surprisingly -- domain-specific models explicitly trained on clinical data are especially brittle, compared to their general domain counterparts. Further, arbitrary phrasing differences can affect fairness, e.g., valid but distinct instructions for mortality prediction yield a range both in overall performance, and in terms of differences between demographic groups.
