Table of Contents
Fetching ...

LLM Sensitivity Evaluation Framework for Clinical Diagnosis

Chenwei Yan, Xiangling Fu, Yuxuan Xiong, Tianyi Wang, Siu Cheung Hui, Ji Wu, Xien Liu

TL;DR

This work introduces LLMSenEval, a framework to rigorously evaluate how large language models respond to perturbations of key medical information in clinical diagnosis. By constructing DiagnosisQA from MedQA and eight derived datasets that perturb gender, age, symptoms, and checkup results, the authors assess both model capability and sensitivity across five state-of-the-art LLMs, with GPT-4 consistently outperforming others but still showing limited reliability in real-world scenarios. The framework combines systematic perturbations with explicit prompt/instruction formats and robust metrics, revealing that current LLMs struggle to maintain stable, medically sensible reasoning when key information shifts. Findings highlight the need for models that reliably attend to critical clinical signals to earn trust for practical medical deployment, and the datasets and code are publicly available for reproducibility and future benchmarking.

Abstract

Large language models (LLMs) have demonstrated impressive performance across various domains. However, for clinical diagnosis, higher expectations are required for LLM's reliability and sensitivity: thinking like physicians and remaining sensitive to key medical information that affects diagnostic reasoning, as subtle variations can lead to different diagnosis results. Yet, existing works focus mainly on investigating the sensitivity of LLMs to irrelevant context and overlook the importance of key information. In this paper, we investigate the sensitivity of LLMs, i.e. GPT-3.5, GPT-4, Gemini, Claude3 and LLaMA2-7b, to key medical information by introducing different perturbation strategies. The evaluation results highlight the limitations of current LLMs in remaining sensitive to key medical information for diagnostic decision-making. The evolution of LLMs must focus on improving their reliability, enhancing their ability to be sensitive to key information, and effectively utilizing this information. These improvements will enhance human trust in LLMs and facilitate their practical application in real-world scenarios. Our code and dataset are available at https://github.com/chenwei23333/DiagnosisQA.

LLM Sensitivity Evaluation Framework for Clinical Diagnosis

TL;DR

This work introduces LLMSenEval, a framework to rigorously evaluate how large language models respond to perturbations of key medical information in clinical diagnosis. By constructing DiagnosisQA from MedQA and eight derived datasets that perturb gender, age, symptoms, and checkup results, the authors assess both model capability and sensitivity across five state-of-the-art LLMs, with GPT-4 consistently outperforming others but still showing limited reliability in real-world scenarios. The framework combines systematic perturbations with explicit prompt/instruction formats and robust metrics, revealing that current LLMs struggle to maintain stable, medically sensible reasoning when key information shifts. Findings highlight the need for models that reliably attend to critical clinical signals to earn trust for practical medical deployment, and the datasets and code are publicly available for reproducibility and future benchmarking.

Abstract

Large language models (LLMs) have demonstrated impressive performance across various domains. However, for clinical diagnosis, higher expectations are required for LLM's reliability and sensitivity: thinking like physicians and remaining sensitive to key medical information that affects diagnostic reasoning, as subtle variations can lead to different diagnosis results. Yet, existing works focus mainly on investigating the sensitivity of LLMs to irrelevant context and overlook the importance of key information. In this paper, we investigate the sensitivity of LLMs, i.e. GPT-3.5, GPT-4, Gemini, Claude3 and LLaMA2-7b, to key medical information by introducing different perturbation strategies. The evaluation results highlight the limitations of current LLMs in remaining sensitive to key medical information for diagnostic decision-making. The evolution of LLMs must focus on improving their reliability, enhancing their ability to be sensitive to key information, and effectively utilizing this information. These improvements will enhance human trust in LLMs and facilitate their practical application in real-world scenarios. Our code and dataset are available at https://github.com/chenwei23333/DiagnosisQA.

Paper Structure

This paper contains 24 sections, 3 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Human doctors are sensitive to key medical information. How do LLMs perform when key information is perturbed?
  • Figure 2: The proposed framework LLMSenEval for LLM sensitivity evaluation.
  • Figure 3: Valid and invalid responses from LLMs.
  • Figure 4: The overall sensitivity performance of five LLMs. The bar chart shows the average difference in accuracy on the Same Answer Subset, and the line chart shows the total correct answers provided by the LLMs on the Different Answer Subset.
  • Figure 5: The impact of prompts on the performance of LLMs on the DiagnosisQA dataset.