Table of Contents
Fetching ...

A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

Andong Tan, Shuyu Dai, Jinglu Wang, Fengtao Zhou, Yan Lu, Xi Wang, Yingcong Chen, Can Yang, Shujie Liu, Hao Chen

Abstract

Clinical practice guidelines (CPGs) play a pivotal role in ensuring evidence-based decision-making and improving patient outcomes. While Large Language Models (LLMs) are increasingly deployed in healthcare scenarios, it is unclear to which extend LLMs could identify and adhere to CPGs during conversations. To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations. We collect 3,418 CPG documents from 9 countries/regions and 2 international organizations published in the last decade spanning across 24 specialties. From these documents, we extract 32,155 clinical recommendations with corresponding publication institute, date, country, specialty, recommendation strength, evidence level, etc. One multi-turn conversation is generated for each recommendation accordingly to evaluate the detection and adherence capabilities of 8 leading LLMs. We find that the 71.1%-89.6% recommendations can be correctly detected, while only 3.6%-29.7% corresponding titles can be correctly referenced, revealing the gap between knowing the guideline contents and where they come from. The adherence rates range from 21.8% to 63.2% in different models, indicating a large gap between knowing the guidelines and being able to apply them. To confirm the validity of our automatic analysis, we further conduct a comprehensive human evaluation involving 56 clinicians from different specialties. To our knowledge, CPGBench is the first benchmark systematically revealing which clinical recommendations LLMs fail to detect or adhere to during conversations. Given that each clinical recommendation may affect a large population and that clinical applications are inherently safety critical, addressing these gaps is crucial for the safe and responsible deployment of LLMs in real world clinical practice.

A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

Abstract

Clinical practice guidelines (CPGs) play a pivotal role in ensuring evidence-based decision-making and improving patient outcomes. While Large Language Models (LLMs) are increasingly deployed in healthcare scenarios, it is unclear to which extend LLMs could identify and adhere to CPGs during conversations. To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations. We collect 3,418 CPG documents from 9 countries/regions and 2 international organizations published in the last decade spanning across 24 specialties. From these documents, we extract 32,155 clinical recommendations with corresponding publication institute, date, country, specialty, recommendation strength, evidence level, etc. One multi-turn conversation is generated for each recommendation accordingly to evaluate the detection and adherence capabilities of 8 leading LLMs. We find that the 71.1%-89.6% recommendations can be correctly detected, while only 3.6%-29.7% corresponding titles can be correctly referenced, revealing the gap between knowing the guideline contents and where they come from. The adherence rates range from 21.8% to 63.2% in different models, indicating a large gap between knowing the guidelines and being able to apply them. To confirm the validity of our automatic analysis, we further conduct a comprehensive human evaluation involving 56 clinicians from different specialties. To our knowledge, CPGBench is the first benchmark systematically revealing which clinical recommendations LLMs fail to detect or adhere to during conversations. Given that each clinical recommendation may affect a large population and that clinical applications are inherently safety critical, addressing these gaps is crucial for the safe and responsible deployment of LLMs in real world clinical practice.

Paper Structure

This paper contains 39 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of CPGBench for evaluating LLM's detection and adherence to CPGs. a Geographical distribution of our collected clinical practice guidelines across the world. b Our benchmark covers all 24 medical specialties defined in American Board on Medical Specialties. c We leverage the original guideline information to synthesize conversations and convert these conversations to proper forms for the detection and adherence capability evaluations. d Recommendation distribution over 10 years across different regions. e An overview of the evaluated models and their content detection, title grounding as well as adherence rates.
  • Figure 2: Specialty statistics of each country/region/international organization in our collected database. Percentage numbers of specialties less than $1\%$ are not displayed for better readability.
  • Figure 3: a The content detection rates in all regions/international organizations of all models are above 50%. b Title grounding rates in the detection task. GPT5 is leading in this sub-task in most regions except the Chinese Mainland. c The adherence rates of all models across all regions or international organizations are consistently lower then their corresponding content detection rates. All plots share the same legend.
  • Figure 4: Content detection and adherence rates across medical specialties. The average performance across models in each specialty is shown in the upper subplot. The average performance across specialties in each model is shown in the right subplot. The subplot layout is consistent across all plots. a The content detection rates are generally high in all medical specialties. b The adherence rates in different models are generally higher in specialties such as allergy and immunology, pediatrics, preventive medicine and family medicine comapred to other specialties. c The content detection rates in the safety critical subset are higher than the rates of the full set. d Adherence rates in the safety critical subset are mostly lower than that of the full set shown in b across specialties.
  • Figure 5: Capability difference of models in detection and adherence measured by detection rates minus adherence rates across specialties.
  • ...and 2 more figures