RAG-X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering

Aswini Sivakumar; Vijayan Sugumaran; Yao Qiang

RAG-X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering

Aswini Sivakumar, Vijayan Sugumaran, Yao Qiang

TL;DR

RAG-X is proposed, a diagnostic framework that evaluates the retriever and generator independently across a triad of QA tasks: information extraction, short-answer generation, and multiple-choice question (MCQ) answering that introduces Context Utilization Efficiency (CUE) metrics to disaggregate system success into interpretable quadrants, isolating verified grounding from deceptive accuracy.

Abstract

Automated question-answering (QA) systems increasingly rely on retrieval-augmented generation (RAG) to ground large language models (LLMs) in authoritative medical knowledge, ensuring clinical accuracy and patient safety in Artificial Intelligence (AI) applications for healthcare. Despite progress in RAG evaluation, current benchmarks focus only on simple multiple-choice QA tasks and employ metrics that poorly capture the semantic precision required for complex QA tasks. These approaches fail to diagnose whether an error stems from faulty retrieval or flawed generation, limiting developers from performing targeted improvement. To address this gap, we propose RAG-X, a diagnostic framework that evaluates the retriever and generator independently across a triad of QA tasks: information extraction, short-answer generation, and multiple-choice question (MCQ) answering. RAG-X introduces Context Utilization Efficiency (CUE) metrics to disaggregate system success into interpretable quadrants, isolating verified grounding from deceptive accuracy. Our experiments reveal an ``Accuracy Fallacy", where a 14\% gap separates perceived system success from evidence-based grounding. By surfacing hidden failure modes, RAG-X offers the diagnostic transparency needed for safe and verifiable clinical RAG systems.

RAG-X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering

TL;DR

Abstract

Paper Structure (13 sections, 1 equation, 1 figure, 4 tables)

This paper contains 13 sections, 1 equation, 1 figure, 4 tables.

Introduction
Related Work
Application of RAG in Medical Domain
Evaluation of RAG in Medical Domain
Method
RAG Pipeline and Medical Normalization
RAG-X Approach
Experiment Settings
Results and Discussions
Comparison of Backbone LLMs
Comparison of Retrieval Models
Case Study: Diagnosing RAG Pipeline with RAG-X
Conclusion

Figures (1)

Figure 1: Overview of RAG System and RAG-X Diagnostics. This figure illustrates the workflow of an RAG system, where external knowledge is indexed into a vector database. A user query interacts with the retriever to fetch relevant information, which is then passed to a generator to produce a response. The RAG-X framework adds diagnostic modules for both retrieval (e.g., ranking, semantic relevance, fine-grained evaluation) and generation (e.g., surface-level similarity, structured output, semantic similarity, and LLM-based judgment) to provide detailed performance analysis.

RAG-X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering

TL;DR

Abstract

RAG-X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (1)