When Evidence Contradicts: Toward Safer Retrieval-Augmented Generation in Healthcare
Saeedeh Javadi, Sara Mirabi, Manan Gangar, Bahadorreza Ofoghi
TL;DR
This work tackles the risk of outdated and contradictory evidence in retrieval-augmented generation for healthcare by building a TGA–PubMed benchmark and a RAG pipeline that emphasizes temporal diversity and contradiction awareness. The methodology combines a three-tier query strategy, temporal-citation balanced evidence selection, and a diversity-aware MMR framework with a contradiction-detection module, evaluating five LLMs. The study reveals that contradictions among retrieved abstracts degrade factual accuracy and that higher contradiction prevalence in recent literature intensifies this challenge, underscoring the need for contradiction-aware filtering and temporal reasoning in medical RAG. Overall, the dataset and framework provide a benchmark and set of insights to guide safer, more reliable RAG systems in high-stakes healthcare contexts.
Abstract
In high-stakes information domains such as healthcare, where large language models (LLMs) can produce hallucinations or misinformation, retrieval-augmented generation (RAG) has been proposed as a mitigation strategy, grounding model outputs in external, domain-specific documents. Yet, this approach can introduce errors when source documents contain outdated or contradictory information. This work investigates the performance of five LLMs in generating RAG-based responses to medicine-related queries. Our contributions are three-fold: i) the creation of a benchmark dataset using consumer medicine information documents from the Australian Therapeutic Goods Administration (TGA), where headings are repurposed as natural language questions, ii) the retrieval of PubMed abstracts using TGA headings, stratified across multiple publication years, to enable controlled temporal evaluation of outdated evidence, and iii) a comparative analysis of the frequency and impact of outdated or contradictory content on model-generated responses, assessing how LLMs integrate and reconcile temporally inconsistent information. Our findings show that contradictions between highly similar abstracts do, in fact, degrade performance, leading to inconsistencies and reduced factual accuracy in model answers. These results highlight that retrieval similarity alone is insufficient for reliable medical RAG and underscore the need for contradiction-aware filtering strategies to ensure trustworthy responses in high-stakes domains.
