Table of Contents
Fetching ...

LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation

David Carmel, Simone Filice, Guy Horowitz, Yoelle Maarek, Alex Shtoff, Oren Somekh, Ran Tavory

TL;DR

LiveRAG provides a public, calibrated benchmark for evaluating RAG-based QA systems, derived from the SIGIR-2025 LiveRAG Challenge and extended with ground-truth answers, answer claims, and IRT-based per-question difficulty and discriminability. It aggregates 895 questions generated via DataMorgana from the FineWeb-10BT corpus, paired with supporting documents and a fixed LLM setup to enable controlled cross-session evaluation and curriculum-like training. The analysis formalizes an IRT-based interpretation of question difficulty using the 2PL model with parameters $b_i$ and $a_i$, implemented via the py-irt package, and demonstrates strong alignment with system performance and leaderboard rankings. Empirical results highlight that difficulty correlates with model performance across LLMs and RAG systems, that multi-document questions pose greater challenges, and that LiveRAG exhibits high linguistic and semantic diversity, underscoring its usefulness for stress-testing long-tail Q&A in retrieval-augmented setups.

Abstract

With Retrieval Augmented Generation (RAG) becoming more and more prominent in generative AI solutions, there is an emerging need for systematically evaluating their effectiveness. We introduce the LiveRAG benchmark, a publicly available dataset of 895 synthetic questions and answers designed to support systematic evaluation of RAG-based Q&A systems. This synthetic benchmark is derived from the one used during the SIGIR'2025 LiveRAG Challenge, where competitors were evaluated under strict time constraints. It is augmented with information that was not made available to competitors during the Challenge, such as the ground-truth answers, together with their associated supporting claims which were used for evaluating competitors' answers. In addition, each question is associated with estimated difficulty and discriminability scores, derived from applying an Item Response Theory model to competitors' responses. Our analysis highlights the benchmark's questions diversity, the wide range of their difficulty levels, and their usefulness in differentiating between system capabilities. The LiveRAG benchmark will hopefully help the community advance RAG research, conduct systematic evaluation, and develop more robust Q&A systems.

LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation

TL;DR

LiveRAG provides a public, calibrated benchmark for evaluating RAG-based QA systems, derived from the SIGIR-2025 LiveRAG Challenge and extended with ground-truth answers, answer claims, and IRT-based per-question difficulty and discriminability. It aggregates 895 questions generated via DataMorgana from the FineWeb-10BT corpus, paired with supporting documents and a fixed LLM setup to enable controlled cross-session evaluation and curriculum-like training. The analysis formalizes an IRT-based interpretation of question difficulty using the 2PL model with parameters and , implemented via the py-irt package, and demonstrates strong alignment with system performance and leaderboard rankings. Empirical results highlight that difficulty correlates with model performance across LLMs and RAG systems, that multi-document questions pose greater challenges, and that LiveRAG exhibits high linguistic and semantic diversity, underscoring its usefulness for stress-testing long-tail Q&A in retrieval-augmented setups.

Abstract

With Retrieval Augmented Generation (RAG) becoming more and more prominent in generative AI solutions, there is an emerging need for systematically evaluating their effectiveness. We introduce the LiveRAG benchmark, a publicly available dataset of 895 synthetic questions and answers designed to support systematic evaluation of RAG-based Q&A systems. This synthetic benchmark is derived from the one used during the SIGIR'2025 LiveRAG Challenge, where competitors were evaluated under strict time constraints. It is augmented with information that was not made available to competitors during the Challenge, such as the ground-truth answers, together with their associated supporting claims which were used for evaluating competitors' answers. In addition, each question is associated with estimated difficulty and discriminability scores, derived from applying an Item Response Theory model to competitors' responses. Our analysis highlights the benchmark's questions diversity, the wide range of their difficulty levels, and their usefulness in differentiating between system capabilities. The LiveRAG benchmark will hopefully help the community advance RAG research, conduct systematic evaluation, and develop more robust Q&A systems.

Paper Structure

This paper contains 24 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Question parameters learned by the IRT-2PL model for the LiveRAG dataset. Top: Difficulty distribution. Left: Discriminability distribution. Middle: Scatter plot of difficulty and discriminability scores of all benchmark questions.
  • Figure 2: Team performance distributions across diff bins. Teams are ordered from left to right by their leaderboard position. The rightmost distribution represents Falcon3-10B without RAG, given for reference.
  • Figure 3: Performance distributions of GPT-4.1, and several LLaMA models of different sizes (without RAG), across the diff bins.
  • Figure 4: Box-plot presentation of the diff distributions across question categorizations. Median values are shown in bold. Number of questions per category is indicated in parentheses.