Table of Contents
Fetching ...

Vendi-RAG: Adaptively Trading-Off Diversity And Quality Significantly Improves Retrieval Augmented Generation With LLMs

Mohammad Reza Rezaei, Adji Bousso Dieng

TL;DR

Vendi-RAG introduces an adaptive, diversity-aware retrieval framework for multi-hop QA by integrating the Vendi Score with a dynamic diversity-relevance trade-off and an LLM-based answer judge. Through iterative refinement, the model increases semantic coverage while maintaining answer quality, adjusting the parameter $s$ according to a quality signal $Q_t$ until a threshold $Thr$ is achieved. Empirical results on HotpotQA, MuSiQue, and 2WikiMultiHopQA show consistent accuracy gains over Adaptive-RAG, especially as the number of retrieved documents grows, and demonstrate robustness across GPT-3.5, GPT-4, and GPT-4o-mini. The approach advances retrieval robustness and reasoning performance by balancing global diversity with query relevance, offering a model-agnostic solution for complex multi-hop QA tasks.

Abstract

Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires connecting information from multiple sources. This paper introduces Vendi-RAG, a framework based on an iterative process that jointly optimizes retrieval diversity and answer quality. This joint optimization leads to significantly higher accuracy for multi-hop QA tasks. Vendi-RAG leverages the Vendi Score (VS), a flexible similarity-based diversity metric, to promote semantic diversity in document retrieval. It then uses an LLM judge that evaluates candidate answers, generated after a reasoning step, and outputs a score that the retriever uses to balance relevance and diversity among the retrieved documents during each iteration. Experiments on three challenging datasets -- HotpotQA, MuSiQue, and 2WikiMultiHopQA -- demonstrate Vendi-RAG's effectiveness in multi-hop reasoning tasks. The framework achieves significant accuracy improvements over traditional single-step and multi-step RAG approaches, with accuracy increases reaching up to +4.2% on HotpotQA, +4.1% on 2WikiMultiHopQA, and +1.3% on MuSiQue compared to Adaptive-RAG, the current best baseline. The benefits of Vendi-RAG are even more pronounced as the number of retrieved documents increases. Finally, we evaluated Vendi-RAG across different LLM backbones, including GPT-3.5, GPT-4, and GPT-4o-mini, and observed consistent improvements, demonstrating that the framework's advantages are model-agnostic.

Vendi-RAG: Adaptively Trading-Off Diversity And Quality Significantly Improves Retrieval Augmented Generation With LLMs

TL;DR

Vendi-RAG introduces an adaptive, diversity-aware retrieval framework for multi-hop QA by integrating the Vendi Score with a dynamic diversity-relevance trade-off and an LLM-based answer judge. Through iterative refinement, the model increases semantic coverage while maintaining answer quality, adjusting the parameter according to a quality signal until a threshold is achieved. Empirical results on HotpotQA, MuSiQue, and 2WikiMultiHopQA show consistent accuracy gains over Adaptive-RAG, especially as the number of retrieved documents grows, and demonstrate robustness across GPT-3.5, GPT-4, and GPT-4o-mini. The approach advances retrieval robustness and reasoning performance by balancing global diversity with query relevance, offering a model-agnostic solution for complex multi-hop QA tasks.

Abstract

Retrieval-augmented generation (RAG) enhances large language models (LLMs) for domain-specific question-answering (QA) tasks by leveraging external knowledge sources. However, traditional RAG systems primarily focus on relevance-based retrieval and often struggle with redundancy, especially when reasoning requires connecting information from multiple sources. This paper introduces Vendi-RAG, a framework based on an iterative process that jointly optimizes retrieval diversity and answer quality. This joint optimization leads to significantly higher accuracy for multi-hop QA tasks. Vendi-RAG leverages the Vendi Score (VS), a flexible similarity-based diversity metric, to promote semantic diversity in document retrieval. It then uses an LLM judge that evaluates candidate answers, generated after a reasoning step, and outputs a score that the retriever uses to balance relevance and diversity among the retrieved documents during each iteration. Experiments on three challenging datasets -- HotpotQA, MuSiQue, and 2WikiMultiHopQA -- demonstrate Vendi-RAG's effectiveness in multi-hop reasoning tasks. The framework achieves significant accuracy improvements over traditional single-step and multi-step RAG approaches, with accuracy increases reaching up to +4.2% on HotpotQA, +4.1% on 2WikiMultiHopQA, and +1.3% on MuSiQue compared to Adaptive-RAG, the current best baseline. The benefits of Vendi-RAG are even more pronounced as the number of retrieved documents increases. Finally, we evaluated Vendi-RAG across different LLM backbones, including GPT-3.5, GPT-4, and GPT-4o-mini, and observed consistent improvements, demonstrating that the framework's advantages are model-agnostic.

Paper Structure

This paper contains 35 sections, 4 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: The process begins with an initial retrieval step, where a diverse set of documents is retrieved using the Vendi Score, ensuring broad semantic coverage. Next, leveraging a reasoning step to construct a coherent path to the final answer, the LLM generates an answer, which then undergoes quality assessment by an LLM judge. Based on the answer quality, the retriever is adjusted to balance diversity and relevance: high-quality answers limit the emphasis on diversity, while low-quality answers prompt the retriever to prioritize diversity more heavily. This adjustment is controlled by an adaptive parameter, $s$, which is updated over iterations. The process continues until the answer quality reaches an optimal threshold, denoted by Thr. Finally, the highest-quality responses and documents are selected, ensuring both diversity and accuracy.
  • Figure 2: Performance comparison of Vendi-RAG and Adaptive-RAG across different document sizes in terms of Exact Match, F1-score, Accuracy, and Vendi Score on HotPotQA. The backbone LLM used is GPT-4o-mini. Vendi-RAG consistently outperforms Adaptive-RAG across all metrics. In particular, performance improves as the number of retrieved documents increases. Different variants of Vendi-RAG are plotted based on the fixed initialization value $s_1$ for the diversity-relevance parameter $s_t$, with $s_1 = 0.8$ achieving the best overall results.
  • Figure 3: Performance comparison of Vendi-RAG and Adaptive-RAG variants across the three datasets using three evaluation metrics: F1-score, Exact Match, and Accuracy. Results show that Vendi-RAG-4o consistently outperforms other variants across all datasets and metrics, with a particularly strong performance on HotpotQA.