Unanswerability Evaluation for Retrieval Augmented Generation
Xiangyu Peng, Prafulla Kumar Choubey, Caiming Xiong, Chien-Sheng Wu
TL;DR
The paper tackles the evaluation gap in retrieval-augmented generation by introducing UAEval4RAG, a KB-aligned framework to assess how well RAG systems reject unanswerable queries. It defines a six-category taxonomy of unanswerable requests, provides an automated data-generation pipeline, and proposes LLM-based metrics—Unanswered Ratio, Acceptable Ratio, and Joint Score—to balance rejection with answer accuracy. Through extensive experiments across multiple components, backbones, and prompting strategies, the work reveals trade-offs and the pivotal roles of LLM selection and prompt design in achieving robust, responsible RAG behavior. The framework enables targeted optimization of RAG configurations for specific knowledge bases, contributing to safer and more reliable AI systems in real-world applications.
Abstract
Existing evaluation frameworks for retrieval-augmented generation (RAG) systems focus on answerable queries, but they overlook the importance of appropriately rejecting unanswerable requests. In this paper, we introduce UAEval4RAG, a framework designed to evaluate whether RAG systems can handle unanswerable queries effectively. We define a taxonomy with six unanswerable categories, and UAEval4RAG automatically synthesizes diverse and challenging queries for any given knowledge base with unanswered ratio and acceptable ratio metrics. We conduct experiments with various RAG components, including retrieval models, rewriting methods, rerankers, language models, and prompting strategies, and reveal hidden trade-offs in performance of RAG systems. Our findings highlight the critical role of component selection and prompt design in optimizing RAG systems to balance the accuracy of answerable queries with high rejection rates of unanswerable ones. UAEval4RAG provides valuable insights and tools for developing more robust and reliable RAG systems.
