Table of Contents
Fetching ...

Generative Information Retrieval Evaluation

Marwah Alaofi, Negar Arabzadeh, Charles L. A. Clarke, Mark Sanderson

TL;DR

This chapter analyzes how to evaluate Generative Information Retrieval (GenIR) systems from two angles: leveraging LLMs for evaluation and assessing end-to-end GenIR outputs, including Retrieval-Augmented Generation (RAG) architectures. It discusses using LLMs to generate relevance labels and create test-collection variants, while addressing circularity concerns when models assess their own outputs through approaches like slow search and human grounding. It proposes nugget-based and subtopic-based evaluation as scalable, reusable principles for GenIR, and examines the unique challenges of evaluating hallucinations and end-to-end response quality in GenIR interfaces. Overall, the work outlines practical strategies for robust, ground-truth-aligned GenIR evaluation and highlights the need for multi-dimensional relevance, validated user simulations, and adapted metrics such as $NDCG$, $MAP$, and $recall@1000$ to ensure meaningful assessment in practice.

Abstract

In this chapter, we consider generative information retrieval evaluation from two distinct but interrelated perspectives. First, large language models (LLMs) themselves are rapidly becoming tools for evaluation, with current research indicating that LLMs may be superior to crowdsource workers and other paid assessors on basic relevance judgement tasks. We review past and ongoing related research, including speculation on the future of shared task initiatives, such as TREC, and a discussion on the continuing need for human assessments. Second, we consider the evaluation of emerging LLM-based generative information retrieval (GenIR) systems, including retrieval augmented generation (RAG) systems. We consider approaches that focus both on the end-to-end evaluation of GenIR systems and on the evaluation of a retrieval component as an element in a RAG system. Going forward, we expect the evaluation of GenIR systems to be at least partially based on LLM-based assessment, creating an apparent circularity, with a system seemingly evaluating its own output. We resolve this apparent circularity in two ways: 1) by viewing LLM-based assessment as a form of "slow search", where a slower IR system is used for evaluation and training of a faster production IR system; and 2) by recognizing a continuing need to ground evaluation in human assessment, even if the characteristics of that human assessment must change.

Generative Information Retrieval Evaluation

TL;DR

This chapter analyzes how to evaluate Generative Information Retrieval (GenIR) systems from two angles: leveraging LLMs for evaluation and assessing end-to-end GenIR outputs, including Retrieval-Augmented Generation (RAG) architectures. It discusses using LLMs to generate relevance labels and create test-collection variants, while addressing circularity concerns when models assess their own outputs through approaches like slow search and human grounding. It proposes nugget-based and subtopic-based evaluation as scalable, reusable principles for GenIR, and examines the unique challenges of evaluating hallucinations and end-to-end response quality in GenIR interfaces. Overall, the work outlines practical strategies for robust, ground-truth-aligned GenIR evaluation and highlights the need for multi-dimensional relevance, validated user simulations, and adapted metrics such as , , and to ensure meaningful assessment in practice.

Abstract

In this chapter, we consider generative information retrieval evaluation from two distinct but interrelated perspectives. First, large language models (LLMs) themselves are rapidly becoming tools for evaluation, with current research indicating that LLMs may be superior to crowdsource workers and other paid assessors on basic relevance judgement tasks. We review past and ongoing related research, including speculation on the future of shared task initiatives, such as TREC, and a discussion on the continuing need for human assessments. Second, we consider the evaluation of emerging LLM-based generative information retrieval (GenIR) systems, including retrieval augmented generation (RAG) systems. We consider approaches that focus both on the end-to-end evaluation of GenIR systems and on the evaluation of a retrieval component as an element in a RAG system. Going forward, we expect the evaluation of GenIR systems to be at least partially based on LLM-based assessment, creating an apparent circularity, with a system seemingly evaluating its own output. We resolve this apparent circularity in two ways: 1) by viewing LLM-based assessment as a form of "slow search", where a slower IR system is used for evaluation and training of a faster production IR system; and 2) by recognizing a continuing need to ground evaluation in human assessment, even if the characteristics of that human assessment must change.
Paper Structure (17 sections, 3 figures)