LLM Sensitivity Evaluation Framework for Clinical Diagnosis
Chenwei Yan, Xiangling Fu, Yuxuan Xiong, Tianyi Wang, Siu Cheung Hui, Ji Wu, Xien Liu
TL;DR
This work introduces LLMSenEval, a framework to rigorously evaluate how large language models respond to perturbations of key medical information in clinical diagnosis. By constructing DiagnosisQA from MedQA and eight derived datasets that perturb gender, age, symptoms, and checkup results, the authors assess both model capability and sensitivity across five state-of-the-art LLMs, with GPT-4 consistently outperforming others but still showing limited reliability in real-world scenarios. The framework combines systematic perturbations with explicit prompt/instruction formats and robust metrics, revealing that current LLMs struggle to maintain stable, medically sensible reasoning when key information shifts. Findings highlight the need for models that reliably attend to critical clinical signals to earn trust for practical medical deployment, and the datasets and code are publicly available for reproducibility and future benchmarking.
Abstract
Large language models (LLMs) have demonstrated impressive performance across various domains. However, for clinical diagnosis, higher expectations are required for LLM's reliability and sensitivity: thinking like physicians and remaining sensitive to key medical information that affects diagnostic reasoning, as subtle variations can lead to different diagnosis results. Yet, existing works focus mainly on investigating the sensitivity of LLMs to irrelevant context and overlook the importance of key information. In this paper, we investigate the sensitivity of LLMs, i.e. GPT-3.5, GPT-4, Gemini, Claude3 and LLaMA2-7b, to key medical information by introducing different perturbation strategies. The evaluation results highlight the limitations of current LLMs in remaining sensitive to key medical information for diagnostic decision-making. The evolution of LLMs must focus on improving their reliability, enhancing their ability to be sensitive to key information, and effectively utilizing this information. These improvements will enhance human trust in LLMs and facilitate their practical application in real-world scenarios. Our code and dataset are available at https://github.com/chenwei23333/DiagnosisQA.
