RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering
Zihan Zhang, Meng Fang, Ling Chen
TL;DR
The paper introduces RetrievalQA to benchmark adaptive retrieval decision-making in short-form open-domain QA, demonstrating that calibration-based approaches require careful threshold tuning while vanilla prompting is insufficient. It proposes Time-Aware Adaptive Retrieval (TA-ARE), a calibration-free prompting method that leverages time cues and demonstrations to improve retrieval decisions via in-context learning. Empirical results show TA-ARE provides notable gains in retrieval and QA accuracy across model families, highlighting practical benefits for efficient, high-quality QA. The work establishes RetrievalQA as a rigorous ARAG testbed and outlines limitations and future directions in prompt design and retrieval robustness.
Abstract
Adaptive retrieval-augmented generation (ARAG) aims to dynamically determine the necessity of retrieval for queries instead of retrieving indiscriminately to enhance the efficiency and relevance of the sourced information. However, previous works largely overlook the evaluation of ARAG approaches, leading to their effectiveness being understudied. This work presents a benchmark, RetrievalQA, comprising 1,271 short-form questions covering new world and long-tail knowledge. The knowledge necessary to answer the questions is absent from LLMs; therefore, external information must be retrieved to answer correctly. This makes RetrievalQA a suitable testbed to evaluate existing ARAG methods. We observe that calibration-based methods heavily rely on threshold tuning, while vanilla prompting is inadequate for guiding LLMs to make reliable retrieval decisions. Based on our findings, we propose Time-Aware Adaptive Retrieval (TA-ARE), a simple yet effective method that helps LLMs assess the necessity of retrieval without calibration or additional training. The dataset and code will be available at https://github.com/hyintell/RetrievalQA
