Table of Contents
Fetching ...

M-Eval: A Heterogeneity-Based Framework for Multi-evidence Validation in Medical RAG Systems

Mengzhou Sun, Sendong Zhao, Jianyu Chen, Haochun Wang, Bin Qin

TL;DR

The paper tackles the persistent problem of factual errors in medical RAG systems by introducing M-Eval, a heterogeneity-inspired backend verification framework. It partitions RAG outputs into responses and evidence, then applies claim extraction, additional evidence retrieval from PubMed using BM25, and a heterogeneity analysis grounded in DerSimonian–Laird random-effects to assess consistency and reliability across multiple sources. By computing a combined evidence-based score and leveraging top-quality articles, M-Eval improves factual accuracy across diverse LLMs, with reported gains up to 23.31% and robust ablations validating the contribution of reliability scoring, heterogeneity analysis, and extra evidence. The approach enhances reliability and reduces diagnostic risk in medical AI applications, while acknowledging practical limitations in meta-analytic data access and the approximation nature of the heterogeneity step.

Abstract

Retrieval-augmented Generation (RAG) has demonstrated potential in enhancing medical question-answering systems through the integration of large language models (LLMs) with external medical literature. LLMs can retrieve relevant medical articles to generate more professional responses efficiently. However, current RAG applications still face problems. They generate incorrect information, such as hallucinations, and they fail to use external knowledge correctly. To solve these issues, we propose a new method named M-Eval. This method is inspired by the heterogeneity analysis approach used in Evidence-Based Medicine (EBM). Our approach can check for factual errors in RAG responses using evidence from multiple sources. First, we extract additional medical literature from external knowledge bases. Then, we retrieve the evidence documents generated by the RAG system. We use heterogeneity analysis to check whether the evidence supports different viewpoints in the response. In addition to verifying the accuracy of the response, we also assess the reliability of the evidence provided by the RAG system. Our method shows an improvement of up to 23.31% accuracy across various LLMs. This work can help detect errors in current RAG-based medical systems. It also makes the applications of LLMs more reliable and reduces diagnostic errors.

M-Eval: A Heterogeneity-Based Framework for Multi-evidence Validation in Medical RAG Systems

TL;DR

The paper tackles the persistent problem of factual errors in medical RAG systems by introducing M-Eval, a heterogeneity-inspired backend verification framework. It partitions RAG outputs into responses and evidence, then applies claim extraction, additional evidence retrieval from PubMed using BM25, and a heterogeneity analysis grounded in DerSimonian–Laird random-effects to assess consistency and reliability across multiple sources. By computing a combined evidence-based score and leveraging top-quality articles, M-Eval improves factual accuracy across diverse LLMs, with reported gains up to 23.31% and robust ablations validating the contribution of reliability scoring, heterogeneity analysis, and extra evidence. The approach enhances reliability and reduces diagnostic risk in medical AI applications, while acknowledging practical limitations in meta-analytic data access and the approximation nature of the heterogeneity step.

Abstract

Retrieval-augmented Generation (RAG) has demonstrated potential in enhancing medical question-answering systems through the integration of large language models (LLMs) with external medical literature. LLMs can retrieve relevant medical articles to generate more professional responses efficiently. However, current RAG applications still face problems. They generate incorrect information, such as hallucinations, and they fail to use external knowledge correctly. To solve these issues, we propose a new method named M-Eval. This method is inspired by the heterogeneity analysis approach used in Evidence-Based Medicine (EBM). Our approach can check for factual errors in RAG responses using evidence from multiple sources. First, we extract additional medical literature from external knowledge bases. Then, we retrieve the evidence documents generated by the RAG system. We use heterogeneity analysis to check whether the evidence supports different viewpoints in the response. In addition to verifying the accuracy of the response, we also assess the reliability of the evidence provided by the RAG system. Our method shows an improvement of up to 23.31% accuracy across various LLMs. This work can help detect errors in current RAG-based medical systems. It also makes the applications of LLMs more reliable and reduces diagnostic errors.

Paper Structure

This paper contains 16 sections, 4 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The task of the M-Eval checking system. M-Eval is designed to detect factual errors in the responses of medical RAG systems and analyze the quality of their evidence. The output of the task should include the label of the given responses and the evaluation of the evidence.
  • Figure 2: The pipeline of the M-Eval checking system. The main part of the system is the heterogeneity analysis detailed at the bottom side. We calculate the reliability score and the stance on the claim of extra evidence and given evidence. The reliability score is based on their revised date, publication type, and mesh heading. Then we test their stance on the claim and analyze the final label of the claim.
  • Figure 3: The example of the Claim extraction. our extraction is separated into two parts. The main claim is combined with the question and the choice in the response. The other claims are selected as the ones most related to the question.
  • Figure 4: The method we calculate the reliability of each medicine article. For all the evidence, we need the information including publication date, publication type, and the mesh heading to analyze whether the article is reliable.
  • Figure 5: The detailed pipeline of the Heterogeneity analysis. We utilize different LLMs to compare whether the knowledge in the evidence article supports the claim. And we gather the claims label to get the final label of the response.
  • ...and 2 more figures