A Comparison of Methods for Evaluating Generative IR

Negar Arabzadeh; Charles L. A. Clarke

A Comparison of Methods for Evaluating Generative IR

Negar Arabzadeh, Charles L. A. Clarke

TL;DR

This work tackles how to evaluate Gen-IR systems, whose outputs are not constrained to a fixed document collection. It proposes five evaluation methods—binary relevance, graded relevance, subtopic relevance, pairwise preferences, and embeddings—and validates them against human judgments on TREC DL datasets using multiple LLMs and open models. The study finds that subtopic relevance offers a practical balance between autonomous operation and auditability, while pairwise preferences achieve the best discrimination when exemplars are available, albeit at higher cost; embeddings perform well but rely on exemplars and are harder to audit. The results advance a framework for transparent, auditable Gen-IR evaluation and provide guidance for selecting methods in RAG and conversational Gen-IR deployments, with future work extending to more datasets, RAG contexts, and human-in-the-loop validation.

Abstract

Information retrieval systems increasingly incorporate generative components. For example, in a retrieval augmented generation (RAG) system, a retrieval component might provide a source of ground truth, while a generative component summarizes and augments its responses. In other systems, a large language model (LLM) might directly generate responses without consulting a retrieval component. While there are multiple definitions of generative information retrieval (Gen-IR) systems, in this paper we focus on those systems where the system's response is not drawn from a fixed collection of documents or passages. The response to a query may be entirely new text. Since traditional IR evaluation methods break down under this model, we explore various methods that extend traditional offline evaluation approaches to the Gen-IR context. Offline IR evaluation traditionally employs paid human assessors, but increasingly LLMs are replacing human assessment, demonstrating capabilities similar or superior to crowdsourced labels. Given that Gen-IR systems do not generate responses from a fixed set, we assume that methods for Gen-IR evaluation must largely depend on LLM-generated labels. Along with methods based on binary and graded relevance, we explore methods based on explicit subtopics, pairwise preferences, and embeddings. We first validate these methods against human assessments on several TREC Deep Learning Track tasks; we then apply these methods to evaluate the output of several purely generative systems. For each method we consider both its ability to act autonomously, without the need for human labels or other input, and its ability to support human auditing. To trust these methods, we must be assured that their results align with human assessments. In order to do so, evaluation criteria must be transparent, so that outcomes can be audited by human assessors.

A Comparison of Methods for Evaluating Generative IR

TL;DR

Abstract

A Comparison of Methods for Evaluating Generative IR

Authors

TL;DR

Abstract

Table of Contents

Figures (10)