Table of Contents
Fetching ...

RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering

Zihan Zhang, Meng Fang, Ling Chen

TL;DR

The paper introduces RetrievalQA to benchmark adaptive retrieval decision-making in short-form open-domain QA, demonstrating that calibration-based approaches require careful threshold tuning while vanilla prompting is insufficient. It proposes Time-Aware Adaptive Retrieval (TA-ARE), a calibration-free prompting method that leverages time cues and demonstrations to improve retrieval decisions via in-context learning. Empirical results show TA-ARE provides notable gains in retrieval and QA accuracy across model families, highlighting practical benefits for efficient, high-quality QA. The work establishes RetrievalQA as a rigorous ARAG testbed and outlines limitations and future directions in prompt design and retrieval robustness.

Abstract

Adaptive retrieval-augmented generation (ARAG) aims to dynamically determine the necessity of retrieval for queries instead of retrieving indiscriminately to enhance the efficiency and relevance of the sourced information. However, previous works largely overlook the evaluation of ARAG approaches, leading to their effectiveness being understudied. This work presents a benchmark, RetrievalQA, comprising 1,271 short-form questions covering new world and long-tail knowledge. The knowledge necessary to answer the questions is absent from LLMs; therefore, external information must be retrieved to answer correctly. This makes RetrievalQA a suitable testbed to evaluate existing ARAG methods. We observe that calibration-based methods heavily rely on threshold tuning, while vanilla prompting is inadequate for guiding LLMs to make reliable retrieval decisions. Based on our findings, we propose Time-Aware Adaptive Retrieval (TA-ARE), a simple yet effective method that helps LLMs assess the necessity of retrieval without calibration or additional training. The dataset and code will be available at https://github.com/hyintell/RetrievalQA

RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering

TL;DR

The paper introduces RetrievalQA to benchmark adaptive retrieval decision-making in short-form open-domain QA, demonstrating that calibration-based approaches require careful threshold tuning while vanilla prompting is insufficient. It proposes Time-Aware Adaptive Retrieval (TA-ARE), a calibration-free prompting method that leverages time cues and demonstrations to improve retrieval decisions via in-context learning. Empirical results show TA-ARE provides notable gains in retrieval and QA accuracy across model families, highlighting practical benefits for efficient, high-quality QA. The work establishes RetrievalQA as a rigorous ARAG testbed and outlines limitations and future directions in prompt design and retrieval robustness.

Abstract

Adaptive retrieval-augmented generation (ARAG) aims to dynamically determine the necessity of retrieval for queries instead of retrieving indiscriminately to enhance the efficiency and relevance of the sourced information. However, previous works largely overlook the evaluation of ARAG approaches, leading to their effectiveness being understudied. This work presents a benchmark, RetrievalQA, comprising 1,271 short-form questions covering new world and long-tail knowledge. The knowledge necessary to answer the questions is absent from LLMs; therefore, external information must be retrieved to answer correctly. This makes RetrievalQA a suitable testbed to evaluate existing ARAG methods. We observe that calibration-based methods heavily rely on threshold tuning, while vanilla prompting is inadequate for guiding LLMs to make reliable retrieval decisions. Based on our findings, we propose Time-Aware Adaptive Retrieval (TA-ARE), a simple yet effective method that helps LLMs assess the necessity of retrieval without calibration or additional training. The dataset and code will be available at https://github.com/hyintell/RetrievalQA
Paper Structure (31 sections, 2 equations, 8 figures, 11 tables)

This paper contains 31 sections, 2 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Above: QA accuracy on our RetrievalQA w/, w/o retrieval, and adaptive retrieval. We set threshold $t=0.5$ for calibration-based Self-RAG asai2023selfrag and use model-basedVanilla prompting for others (\ref{['sec_preliminary']}). We find that Self-RAG requires threshold tuning to balance QA performance and retrieval efficiency, while vanilla prompting is insufficient in guiding LLMs to make reliable retrieval decisions (\ref{['sec_initial_results']}). Below: an error analysis for GPT-3.5. At least half of the time, GPT-3.5 is unaware that it needs retrieval (i.e., Red area, \ref{['sec_error_analysis']}).
  • Figure 2: Retrieval accuracy between long-tail vs. new world knowledge (i.e., dotted vs. slash) using Vanilla and ours TA-ARE (i.e., yellow vs. blue).
  • Figure 3: Error analysis of ours TA-ART for GPT-3.5. Compared to Fig.\ref{['fig_sankey_gpt35']}, we can see that the areas of Red and Blue significantly reduce, indicating that GPT-3.5 has improved awareness of when it needs retrieval.
  • Figure 4: Effect of different numbers of demonstrations. Averaged for all models.
  • Figure 5: Vanilla prompt template for adaptive retrieval (\ref{['sec_adaptive_rag_method']}).
  • ...and 3 more figures