Table of Contents
Fetching ...

IRB: Automated Generation of Robust Factuality Benchmarks

Lam Thanh Do, Bhagyashree Taleka, Hozaifa Ammar Bhutta, Vikram Sharma Mailthody, Kevin Chen-Chuan Chang, Wen-mei Hwu

TL;DR

This paper addresses the saturation and maintenance challenges of static factuality benchmarks for retrieval-augmented generation (RAG). It introduces IRB, a fully automated, scaffolded Benchmark generation framework that grounds QA in human-verified Wikipedia citations (factual scaffold) and constrains QA via a knowledge-graph-based algorithm (algorithmic scaffold). The IRB1K benchmark demonstrates that frontier LLMs struggle in closed-book settings, but retrieval and reasoning-focused models improve performance, with retriever quality emerging as the primary driver of system correctness. The work also provides an end-to-end pipeline, implementation details, quality analyses, and open-source code/data to enable replication and further research in robust RAG evaluation. Overall, IRB offers a scalable approach to generating diverse, up-to-date, and controllable factuality benchmarks that push RAG systems toward more reliable grounding.

Abstract

Static benchmarks for RAG systems often suffer from rapid saturation and require significant manual effort to maintain robustness. To address this, we present IRB, a framework for automatically generating benchmarks to evaluate the factuality of RAG systems. IRB employs a structured generation pipeline utilizing \textit{factual scaffold} and \textit{algorithmic scaffold}. We utilize IRB to construct a benchmark and evaluate frontier LLMs and retrievers. Our results demonstrate that IRB poses a significant challenge for frontier LLMs in the closed-book setting. Furthermore, our evaluation suggests that reasoning LLMs are more reliable, and that improving the retrieval component may yield more cost-effective gains in RAG system correctness than scaling the generator.

IRB: Automated Generation of Robust Factuality Benchmarks

TL;DR

This paper addresses the saturation and maintenance challenges of static factuality benchmarks for retrieval-augmented generation (RAG). It introduces IRB, a fully automated, scaffolded Benchmark generation framework that grounds QA in human-verified Wikipedia citations (factual scaffold) and constrains QA via a knowledge-graph-based algorithm (algorithmic scaffold). The IRB1K benchmark demonstrates that frontier LLMs struggle in closed-book settings, but retrieval and reasoning-focused models improve performance, with retriever quality emerging as the primary driver of system correctness. The work also provides an end-to-end pipeline, implementation details, quality analyses, and open-source code/data to enable replication and further research in robust RAG evaluation. Overall, IRB offers a scalable approach to generating diverse, up-to-date, and controllable factuality benchmarks that push RAG systems toward more reliable grounding.

Abstract

Static benchmarks for RAG systems often suffer from rapid saturation and require significant manual effort to maintain robustness. To address this, we present IRB, a framework for automatically generating benchmarks to evaluate the factuality of RAG systems. IRB employs a structured generation pipeline utilizing \textit{factual scaffold} and \textit{algorithmic scaffold}. We utilize IRB to construct a benchmark and evaluate frontier LLMs and retrievers. Our results demonstrate that IRB poses a significant challenge for frontier LLMs in the closed-book setting. Furthermore, our evaluation suggests that reasoning LLMs are more reliable, and that improving the retrieval component may yield more cost-effective gains in RAG system correctness than scaling the generator.
Paper Structure (29 sections, 16 figures, 11 tables)

This paper contains 29 sections, 16 figures, 11 tables.

Figures (16)

  • Figure 1: Fact & supporting documents extraction from a citing sentence. The resulting fact contains two keypoints, each with different supporting documents. Groundedness check is performed on each (keypoint, document) pair.
  • Figure 2: The question generation process operates in three stages. First, a fact is structured into a knowledge graph. Subsequently, this graph is transformed into up to three distinct variants, namely single-hop, multi-hop, and false-premise. Finally, each variant is utilized to generate a corresponding natural language question. In this figure, we only show the question generation process for the single-hop variant. The reference date is 29 Sept. 2025.
  • Figure 3: Effect of reranking.
  • Figure 4: RAG performance (correctness and incorrectness) when the number of retrieval contexts are increased
  • Figure 5: Comparison of LLM reasoning effort in RAG (with retrieval) versus closed-book (without retrieval) settings across questions with varying attributes.
  • ...and 11 more figures