Table of Contents
Fetching ...

Retrieval-Augmented Generation with Estimation of Source Reliability

Jeongyeon Hwang, Junyoung Park, Hyejin Park, Dongwoo Kim, Sangdon Park, Jungseul Ok

TL;DR

This work addresses factual accuracy in Retrieval-Augmented Generation when sources exhibit heterogeneous reliability. It introduces Reliability-Aware RAG (RA-RAG), which performs iterative reliability estimation across $N$ sources to assign weights $v_i$ and employs a $\kappa$-Reliable and Relevant Source Selection ($\kappa$-RRSS) to retrieve from a small, trustworthy subset, followed by WMV-based aggregation. Key innovations include automated reliability estimation without manual fact-checking via cross-source queries and a filtering plus semantic clustering pipeline (AlignScore and $\mathcal{C}(\cdot)$) to ground and combine per-source outputs efficiently. Empirical results on synthetic benchmarks and real-world sources show that RA-RAG consistently surpasses baselines, closely approaching Oracle WMV as the source pool grows, and achieves strong reliability correlations (e.g., PCC and SRCC near 0.99) while reducing inference costs through $\kappa$-RRSS. This approach enhances the practicality of RAG systems for real-world knowledge bases by delivering more grounded, trustworthy answers with scalable retrieval.

Abstract

Retrieval-Augmented Generation (RAG) is an effective approach to enhance the factual accuracy of large language models (LLMs) by retrieving information from external databases, which are typically composed of diverse sources, to supplement the limited internal knowledge of LLMs. However, the standard RAG often risks retrieving incorrect information, as it relies solely on relevance between a query and a document, overlooking the heterogeneous reliability of these sources. To address this issue, we propose Reliability-Aware RAG (RA-RAG), a new multi-source RAG framework that estimates the reliability of sources and leverages this information to prioritize highly reliable and relevant documents, ensuring more robust and accurate response generation. Specifically, RA-RAG first estimates source reliability by cross-checking information across multiple sources. It then retrieves documents from the top-$κ$ reliable and relevant sources and aggregates their information using weighted majority voting (WMV), where the selective retrieval ensures scalability while not compromising the performance. Comprehensive experiments show that RA-RAG consistently outperforms baselines in scenarios with heterogeneous source reliability while scaling efficiently as the number of sources increases. Furthermore, we demonstrate the ability of RA-RAG to estimate real-world sources' reliability, highlighting its practical applicability. \jy{Our code and data are available at \href{https://github.com/ml-postech/RA-RAG}{RA-RAG}.}

Retrieval-Augmented Generation with Estimation of Source Reliability

TL;DR

This work addresses factual accuracy in Retrieval-Augmented Generation when sources exhibit heterogeneous reliability. It introduces Reliability-Aware RAG (RA-RAG), which performs iterative reliability estimation across sources to assign weights and employs a -Reliable and Relevant Source Selection (-RRSS) to retrieve from a small, trustworthy subset, followed by WMV-based aggregation. Key innovations include automated reliability estimation without manual fact-checking via cross-source queries and a filtering plus semantic clustering pipeline (AlignScore and ) to ground and combine per-source outputs efficiently. Empirical results on synthetic benchmarks and real-world sources show that RA-RAG consistently surpasses baselines, closely approaching Oracle WMV as the source pool grows, and achieves strong reliability correlations (e.g., PCC and SRCC near 0.99) while reducing inference costs through -RRSS. This approach enhances the practicality of RAG systems for real-world knowledge bases by delivering more grounded, trustworthy answers with scalable retrieval.

Abstract

Retrieval-Augmented Generation (RAG) is an effective approach to enhance the factual accuracy of large language models (LLMs) by retrieving information from external databases, which are typically composed of diverse sources, to supplement the limited internal knowledge of LLMs. However, the standard RAG often risks retrieving incorrect information, as it relies solely on relevance between a query and a document, overlooking the heterogeneous reliability of these sources. To address this issue, we propose Reliability-Aware RAG (RA-RAG), a new multi-source RAG framework that estimates the reliability of sources and leverages this information to prioritize highly reliable and relevant documents, ensuring more robust and accurate response generation. Specifically, RA-RAG first estimates source reliability by cross-checking information across multiple sources. It then retrieves documents from the top- reliable and relevant sources and aggregates their information using weighted majority voting (WMV), where the selective retrieval ensures scalability while not compromising the performance. Comprehensive experiments show that RA-RAG consistently outperforms baselines in scenarios with heterogeneous source reliability while scaling efficiently as the number of sources increases. Furthermore, we demonstrate the ability of RA-RAG to estimate real-world sources' reliability, highlighting its practical applicability. \jy{Our code and data are available at \href{https://github.com/ml-postech/RA-RAG}{RA-RAG}.}

Paper Structure

This paper contains 35 sections, 6 equations, 14 figures, 18 tables, 1 algorithm.

Figures (14)

  • Figure 1: Comparison between the standard RAG and RA-RAG. The standard RAG retrieves documents without distinguishing sources, leading to the risk of incorporating incorrect information from unreliable sources (e.g., falsely associating COVID-19 with 5G networks). In contrast, RA-RAG estimates the reliability of each source (denoted by the numbers inside circles) and selectively retrieves documents from highly reliable and relevant sources, detailed in Section \ref{['method:inference']}. The information from multiples sources are then aggregated using Weighted Majority Voting (WMV), ensuring a more accurate final answer (e.g., correctly identifying SARS-CoV-2 as the cause of COVID-19).
  • Figure 2: Accuracy performance on NQ and TQA datasets. (a) Results with heterogeneous reliability via beta priors for varying sources (4 to 9) across the Llama3-8B-Instruct and GPT-4o-mini models. See Appendix \ref{['appendix:exp:beta_prior']} on the HotpotQA dataset and Phi3-mini-Instruct model. (b) Results with adversarial setting via adversary-hammer prior for varying adversaries (1 to 7) with Llama3-8B-Instruct model, highlighting overall trends. Exact values, which may overlap significantly, are provided in Appendix \ref{['tab:em_adversary_appendix']} with HotpotQA results.
  • Figure 3: A qualitative example comparing the answers produces by MV and RA-RAG for a query from the NQ dataset. Additional examples are available in Appendix \ref{['appendix:qualitaive:mv_ours']}.
  • Figure 4: Accuracy across different values of $\kappa$ on Llama3-8B-Instruct and NQ dataset. Results for other datasets are provided in Appendix \ref{['appendix:exp:rrs']}
  • Figure 5: Results of reliability estimation under augmented variation for User A. Additional results for Politician A and B are in Appendix \ref{['appendix:real']}.
  • ...and 9 more figures