Table of Contents
Fetching ...

CARE-RAG - Clinical Assessment and Reasoning in RAG

Deepthi Potluri, Aby Mammen Mathew, Jeffrey B DeWitt, Alexander L. Rasgon, Yide Hao, Junyuan Hong, Ying Ding

TL;DR

This work addresses the problem that retrieving evidence does not guarantee correct clinical reasoning by LLMs. It proposes CARE-RAG, a benchmark that systematically manipulates context quality and reasoning demand within a WET-based clinical guideline QA setting. The study evaluates 20 LLMs across context fidelity, reasoning complexity, and question type using clinician-validated gold answers and reasoning traces. Findings show that many models achieve high MCQ accuracy but grounding of reasoning in retrieved evidence is inconsistent, underscoring the need for guardrails, prompt design, and exploration of advanced RAG architectures to enable safe clinical use.

Abstract

Access to the right evidence does not guarantee that large language models (LLMs) will reason with it correctly. This gap between retrieval and reasoning is especially concerning in clinical settings, where outputs must align with structured protocols. We study this gap using Written Exposure Therapy (WET) guidelines as a testbed. In evaluating model responses to curated clinician-vetted questions, we find that errors persist even when authoritative passages are provided. To address this, we propose an evaluation framework that measures accuracy, consistency, and fidelity of reasoning. Our results highlight both the potential and the risks: retrieval-augmented generation (RAG) can constrain outputs, but safe deployment requires assessing reasoning as rigorously as retrieval.

CARE-RAG - Clinical Assessment and Reasoning in RAG

TL;DR

This work addresses the problem that retrieving evidence does not guarantee correct clinical reasoning by LLMs. It proposes CARE-RAG, a benchmark that systematically manipulates context quality and reasoning demand within a WET-based clinical guideline QA setting. The study evaluates 20 LLMs across context fidelity, reasoning complexity, and question type using clinician-validated gold answers and reasoning traces. Findings show that many models achieve high MCQ accuracy but grounding of reasoning in retrieved evidence is inconsistent, underscoring the need for guardrails, prompt design, and exploration of advanced RAG architectures to enable safe clinical use.

Abstract

Access to the right evidence does not guarantee that large language models (LLMs) will reason with it correctly. This gap between retrieval and reasoning is especially concerning in clinical settings, where outputs must align with structured protocols. We study this gap using Written Exposure Therapy (WET) guidelines as a testbed. In evaluating model responses to curated clinician-vetted questions, we find that errors persist even when authoritative passages are provided. To address this, we propose an evaluation framework that measures accuracy, consistency, and fidelity of reasoning. Our results highlight both the potential and the risks: retrieval-augmented generation (RAG) can constrain outputs, but safe deployment requires assessing reasoning as rigorously as retrieval.

Paper Structure

This paper contains 16 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1:
  • Figure 2: Accuracy across entailment score bins for three models (Qwen-QwQ-32B, Llama-3.1-8B-Instruct, BioMistral-7B), separated by MCQ and Yes/No questions; higher entailment generally improves MCQ accuracy, while Yes/No remains less consistent.