IRB: Automated Generation of Robust Factuality Benchmarks
Lam Thanh Do, Bhagyashree Taleka, Hozaifa Ammar Bhutta, Vikram Sharma Mailthody, Kevin Chen-Chuan Chang, Wen-mei Hwu
TL;DR
This paper addresses the saturation and maintenance challenges of static factuality benchmarks for retrieval-augmented generation (RAG). It introduces IRB, a fully automated, scaffolded Benchmark generation framework that grounds QA in human-verified Wikipedia citations (factual scaffold) and constrains QA via a knowledge-graph-based algorithm (algorithmic scaffold). The IRB1K benchmark demonstrates that frontier LLMs struggle in closed-book settings, but retrieval and reasoning-focused models improve performance, with retriever quality emerging as the primary driver of system correctness. The work also provides an end-to-end pipeline, implementation details, quality analyses, and open-source code/data to enable replication and further research in robust RAG evaluation. Overall, IRB offers a scalable approach to generating diverse, up-to-date, and controllable factuality benchmarks that push RAG systems toward more reliable grounding.
Abstract
Static benchmarks for RAG systems often suffer from rapid saturation and require significant manual effort to maintain robustness. To address this, we present IRB, a framework for automatically generating benchmarks to evaluate the factuality of RAG systems. IRB employs a structured generation pipeline utilizing \textit{factual scaffold} and \textit{algorithmic scaffold}. We utilize IRB to construct a benchmark and evaluate frontier LLMs and retrievers. Our results demonstrate that IRB poses a significant challenge for frontier LLMs in the closed-book setting. Furthermore, our evaluation suggests that reasoning LLMs are more reliable, and that improving the retrieval component may yield more cost-effective gains in RAG system correctness than scaling the generator.
