Table of Contents
Fetching ...

Diversify-verify-adapt: Efficient and Robust Retrieval-Augmented Ambiguous Question Answering

Yeonjun In, Sungchul Kim, Ryan A. Rossi, Md Mehrab Tanjim, Tong Yu, Ritwik Sinha, Chanyoung Park

TL;DR

Diva tackles the practical inefficiencies and low-quality retrievals in retrieval-augmented QA for ambiguous questions by introducing Retrieval Diversification (RD) to proactively cover diverse interpretations and Adaptive Generation (AG) to verify passage quality and choose the most suitable generation strategy. RD infers pseudo-interpretations via dual LLM prompts and retrieves passages that maximize coverage of these interpretations, while AG uses a novel retrieval verification (RV) scheme that grades retrieval quality into Useful, PartialUseful, and Useless, guiding whether to answer via RAG or rely on the LLM's internal knowledge. Empirical results on ASQA and SituatedQA show Diva outperforms Iterative RAG and CRAG in both accuracy and efficiency (approximately $1.5$–$3\times$ faster), with RD consistently improving retrieval quality and AG enhancing robustness under low-quality retrieval. The framework demonstrates strong generalization across backbones (from Llama3 to GPT-4) and remains effective on unambiguous questions, indicating practical impact for real-world ambiguous QA systems. Limitations include sensitivity of the RV component to the LLM choice and the need for enhanced ambiguity classifiers and verifiers in future work.

Abstract

The retrieval augmented generation (RAG) framework addresses an ambiguity in user queries in QA systems by retrieving passages that cover all plausible interpretations and generating comprehensive responses based on the passages. However, our preliminary studies reveal that a single retrieval process often suffers from low quality results, as the retrieved passages frequently fail to capture all plausible interpretations. Although the iterative RAG approach has been proposed to address this problem, it comes at the cost of significantly reduced efficiency. To address these issues, we propose the diversify-verify-adapt (DIVA) framework. DIVA first diversifies the retrieved passages to encompass diverse interpretations. Subsequently, DIVA verifies the quality of the passages and adapts the most suitable approach tailored to their quality. This approach improves the QA systems accuracy and robustness by handling low quality retrieval issue in ambiguous questions, while enhancing efficiency.

Diversify-verify-adapt: Efficient and Robust Retrieval-Augmented Ambiguous Question Answering

TL;DR

Diva tackles the practical inefficiencies and low-quality retrievals in retrieval-augmented QA for ambiguous questions by introducing Retrieval Diversification (RD) to proactively cover diverse interpretations and Adaptive Generation (AG) to verify passage quality and choose the most suitable generation strategy. RD infers pseudo-interpretations via dual LLM prompts and retrieves passages that maximize coverage of these interpretations, while AG uses a novel retrieval verification (RV) scheme that grades retrieval quality into Useful, PartialUseful, and Useless, guiding whether to answer via RAG or rely on the LLM's internal knowledge. Empirical results on ASQA and SituatedQA show Diva outperforms Iterative RAG and CRAG in both accuracy and efficiency (approximately faster), with RD consistently improving retrieval quality and AG enhancing robustness under low-quality retrieval. The framework demonstrates strong generalization across backbones (from Llama3 to GPT-4) and remains effective on unambiguous questions, indicating practical impact for real-world ambiguous QA systems. Limitations include sensitivity of the RV component to the LLM choice and the need for enhanced ambiguity classifiers and verifiers in future work.

Abstract

The retrieval augmented generation (RAG) framework addresses an ambiguity in user queries in QA systems by retrieving passages that cover all plausible interpretations and generating comprehensive responses based on the passages. However, our preliminary studies reveal that a single retrieval process often suffers from low quality results, as the retrieved passages frequently fail to capture all plausible interpretations. Although the iterative RAG approach has been proposed to address this problem, it comes at the cost of significantly reduced efficiency. To address these issues, we propose the diversify-verify-adapt (DIVA) framework. DIVA first diversifies the retrieved passages to encompass diverse interpretations. Subsequently, DIVA verifies the quality of the passages and adapts the most suitable approach tailored to their quality. This approach improves the QA systems accuracy and robustness by handling low quality retrieval issue in ambiguous questions, while enhancing efficiency.
Paper Structure (37 sections, 7 equations, 8 figures, 12 tables, 1 algorithm)

This paper contains 37 sections, 7 equations, 8 figures, 12 tables, 1 algorithm.

Figures (8)

  • Figure 1: Trade-off between performance and efficiency under GPT-4 backbone on ASQA. Notably, Diva achieves better performance to the iterative RAG kim2023tree, while significantly more efficient (that is, 2x faster and 1.8x cheaper). The size of the circle indicates the cost per query ($). Closed-book LLM indicates the traditional few-shot prompting method used in brown2020language
  • Figure 2: A conceptual comparison of RAG approaches to ambiguous QA. (a) Vanilla RAG retrieves passages and generates answers in a single pass, but it may not collect enough information for diverse interpretations (i.e., low-quality retrieval), compromising factual accuracy. (b) Iterative RAG retrieves passages and generates answers in a loop, using previous interpretations to enhance each subsequent iteration's retrieval for exploring missing interpretations. Although effective, it is inefficient due to the repeated use of LLMs and retrievers. (c) Diva retrieves passages covering diverse interpretations without relying on the iterative process and selects the most suitable knowledge for response generation by verifying retrieval quality.
  • Figure 3: Preliminary results on ASQA. (a) Portion of each quality label of retrieved passages. (b) Performance comparison upon the quality label.
  • Figure 4: Conceptual example of prompts for pseudo-interpretations inference.
  • Figure 5: Comparison of the number of tokens per query using GPT-4 backbone. RD, RV, and AG indicate the proposed retrieval diversify, retrieval verify, and adaptive generate module, respectively.
  • ...and 3 more figures