Table of Contents
Fetching ...

Evaluating Generative Ad Hoc Information Retrieval

Lukas Gienapp, Harrisen Scells, Niklas Deckers, Janek Bevendorff, Shuai Wang, Johannes Kiesel, Shahbaz Syed, Maik Fröbe, Guido Zuccon, Benno Stein, Matthias Hagen, Martin Potthast

TL;DR

This work defines generative ad hoc retrieval and situates it as a fourth-generation web search task that synthesizes information across sources into a grounded text SERP. It builds a theory-driven user model with four evaluation objectives (Prompting, Retrieval, Grounding, Presentation) and three core components (utility, reading, accumulation) to quantify text SERP usefulness. The authors propose operationalization practices (offline/online evaluation, statement segmentation, reference-free and reference-based assessments) and relate their framework to SWAN and EXAM, arguing for a grounded, comparable, and scalable evaluation of generative retrieval. They emphasize the need for empirical validation, meta-evaluation, and user studies to establish reliable measures for grounding fidelity, coverage, coherence, and overall usefulness in practical systems.

Abstract

Recent advances in large language models have enabled the development of viable generative retrieval systems. Instead of a traditional document ranking, generative retrieval systems often directly return a grounded generated text as a response to a query. Quantifying the utility of the textual responses is essential for appropriately evaluating such generative ad hoc retrieval. Yet, the established evaluation methodology for ranking-based ad hoc retrieval is not suited for the reliable and reproducible evaluation of generated responses. To lay a foundation for developing new evaluation methods for generative retrieval systems, we survey the relevant literature from the fields of information retrieval and natural language processing, identify search tasks and system architectures in generative retrieval, develop a new user model, and study its operationalization.

Evaluating Generative Ad Hoc Information Retrieval

TL;DR

This work defines generative ad hoc retrieval and situates it as a fourth-generation web search task that synthesizes information across sources into a grounded text SERP. It builds a theory-driven user model with four evaluation objectives (Prompting, Retrieval, Grounding, Presentation) and three core components (utility, reading, accumulation) to quantify text SERP usefulness. The authors propose operationalization practices (offline/online evaluation, statement segmentation, reference-free and reference-based assessments) and relate their framework to SWAN and EXAM, arguing for a grounded, comparable, and scalable evaluation of generative retrieval. They emphasize the need for empirical validation, meta-evaluation, and user studies to establish reliable measures for grounding fidelity, coverage, coherence, and overall usefulness in practical systems.

Abstract

Recent advances in large language models have enabled the development of viable generative retrieval systems. Instead of a traditional document ranking, generative retrieval systems often directly return a grounded generated text as a response to a query. Quantifying the utility of the textual responses is essential for appropriately evaluating such generative ad hoc retrieval. Yet, the established evaluation methodology for ranking-based ad hoc retrieval is not suited for the reliable and reproducible evaluation of generated responses. To lay a foundation for developing new evaluation methods for generative retrieval systems, we survey the relevant literature from the fields of information retrieval and natural language processing, identify search tasks and system architectures in generative retrieval, develop a new user model, and study its operationalization.
Paper Structure (35 sections, 6 figures, 1 table)

This paper contains 35 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: A search engine results page (SERP) has traditionally been a list of document references (list SERP, left). Many generative retrieval systems now have "reinvented" SERPs as generated texts with references (text SERP, right).
  • Figure 2: In generative ad hoc retrieval, a retrieval model is combined with a language model. The notation assumes $\rho$ and $\psi$ have texts from $\mathcal{T}$ as input and output, and that they can be complex pieces of software, like Google or ChatGPT.
  • Figure 3: Taxonomy of generative information retrieval and its two main instantiations: gen-er-a-tion-augmented retrieval (GAR, yielding list SERPs) and retrieval-augmented generation (RAG, yielding text SERPs; focus of this paper).
  • Figure 4: The information search process vakkari:2016 transforms an information need into a search outcome (top row). Respective corresponding evaluation objectives allow the derivation of a user model for an evaluation setting. Generative IR systems cover the steps of 'selection', 'interaction', and 'synthesis', for which we formulate the corresponding evaluation objectives 'retrieval', 'grounding', and 'presentation' (bottom row).
  • Figure 5: Taxonomy of utility dimensions in generative ad hoc retrieval; colors indicating the evaluation objectives.
  • ...and 1 more figures