LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation
David Carmel, Simone Filice, Guy Horowitz, Yoelle Maarek, Alex Shtoff, Oren Somekh, Ran Tavory
TL;DR
LiveRAG provides a public, calibrated benchmark for evaluating RAG-based QA systems, derived from the SIGIR-2025 LiveRAG Challenge and extended with ground-truth answers, answer claims, and IRT-based per-question difficulty and discriminability. It aggregates 895 questions generated via DataMorgana from the FineWeb-10BT corpus, paired with supporting documents and a fixed LLM setup to enable controlled cross-session evaluation and curriculum-like training. The analysis formalizes an IRT-based interpretation of question difficulty using the 2PL model with parameters $b_i$ and $a_i$, implemented via the py-irt package, and demonstrates strong alignment with system performance and leaderboard rankings. Empirical results highlight that difficulty correlates with model performance across LLMs and RAG systems, that multi-document questions pose greater challenges, and that LiveRAG exhibits high linguistic and semantic diversity, underscoring its usefulness for stress-testing long-tail Q&A in retrieval-augmented setups.
Abstract
With Retrieval Augmented Generation (RAG) becoming more and more prominent in generative AI solutions, there is an emerging need for systematically evaluating their effectiveness. We introduce the LiveRAG benchmark, a publicly available dataset of 895 synthetic questions and answers designed to support systematic evaluation of RAG-based Q&A systems. This synthetic benchmark is derived from the one used during the SIGIR'2025 LiveRAG Challenge, where competitors were evaluated under strict time constraints. It is augmented with information that was not made available to competitors during the Challenge, such as the ground-truth answers, together with their associated supporting claims which were used for evaluating competitors' answers. In addition, each question is associated with estimated difficulty and discriminability scores, derived from applying an Item Response Theory model to competitors' responses. Our analysis highlights the benchmark's questions diversity, the wide range of their difficulty levels, and their usefulness in differentiating between system capabilities. The LiveRAG benchmark will hopefully help the community advance RAG research, conduct systematic evaluation, and develop more robust Q&A systems.
