Table of Contents
Fetching ...

Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I

Harrie Oosterhuis, Rolf Jagerman, Zhen Qin, Xuanhui Wang, Michael Bendersky

TL;DR

This paper tackles the problem of unreliable IR evaluation when using generative-model–generated relevance labels by introducing two theoretically grounded methods to produce reliable confidence intervals around ranking metrics. Prediction-Powered Inference (PPI) blends predicted relevance with ground-truth errors to yield a potentially tighter dataset-level CI, while Conformal Risk Control (CRC) extends to per-document and per-query CIs by perturbing predicted label distributions and calibrating bounds. The authors demonstrate that both methods achieve accurate coverage with fewer human annotations than bootstrap and that CRC provides informative per-query CIs that reflect where the model is more or less reliable. These results offer a practical path to reliable benchmarking of IR systems in low-resource settings, where manual labeling is expensive or infeasible.

Abstract

The traditional evaluation of information retrieval (IR) systems is generally very costly as it requires manual relevance annotation from human experts. Recent advancements in generative artificial intelligence -- specifically large language models (LLMs) -- can generate relevance annotations at an enormous scale with relatively small computational costs. Potentially, this could alleviate the costs traditionally associated with IR evaluation and make it applicable to numerous low-resource applications. However, generated relevance annotations are not immune to (systematic) errors, and as a result, directly using them for evaluation produces unreliable results. In this work, we propose two methods based on prediction-powered inference and conformal risk control that utilize computer-generated relevance annotations to place reliable confidence intervals (CIs) around IR evaluation metrics. Our proposed methods require a small number of reliable annotations from which the methods can statistically analyze the errors in the generated annotations. Using this information, we can place CIs around evaluation metrics with strong theoretical guarantees. Unlike existing approaches, our conformal risk control method is specifically designed for ranking metrics and can vary its CIs per query and document. Our experimental results show that our CIs accurately capture both the variance and bias in evaluation based on LLM annotations, better than the typical empirical bootstrapping estimates. We hope our contributions bring reliable evaluation to the many IR applications where this was traditionally infeasible.

Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I

TL;DR

This paper tackles the problem of unreliable IR evaluation when using generative-model–generated relevance labels by introducing two theoretically grounded methods to produce reliable confidence intervals around ranking metrics. Prediction-Powered Inference (PPI) blends predicted relevance with ground-truth errors to yield a potentially tighter dataset-level CI, while Conformal Risk Control (CRC) extends to per-document and per-query CIs by perturbing predicted label distributions and calibrating bounds. The authors demonstrate that both methods achieve accurate coverage with fewer human annotations than bootstrap and that CRC provides informative per-query CIs that reflect where the model is more or less reliable. These results offer a practical path to reliable benchmarking of IR systems in low-resource settings, where manual labeling is expensive or infeasible.

Abstract

The traditional evaluation of information retrieval (IR) systems is generally very costly as it requires manual relevance annotation from human experts. Recent advancements in generative artificial intelligence -- specifically large language models (LLMs) -- can generate relevance annotations at an enormous scale with relatively small computational costs. Potentially, this could alleviate the costs traditionally associated with IR evaluation and make it applicable to numerous low-resource applications. However, generated relevance annotations are not immune to (systematic) errors, and as a result, directly using them for evaluation produces unreliable results. In this work, we propose two methods based on prediction-powered inference and conformal risk control that utilize computer-generated relevance annotations to place reliable confidence intervals (CIs) around IR evaluation metrics. Our proposed methods require a small number of reliable annotations from which the methods can statistically analyze the errors in the generated annotations. Using this information, we can place CIs around evaluation metrics with strong theoretical guarantees. Unlike existing approaches, our conformal risk control method is specifically designed for ranking metrics and can vary its CIs per query and document. Our experimental results show that our CIs accurately capture both the variance and bias in evaluation based on LLM annotations, better than the typical empirical bootstrapping estimates. We hope our contributions bring reliable evaluation to the many IR applications where this was traditionally infeasible.
Paper Structure (26 sections, 28 equations, 4 figures, 1 table)

This paper contains 26 sections, 28 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Three different predicted relevance distributions (left) and their corresponding $\hat{\mu}_\text{CRC}(d, \lambda)$ curves (right).
  • Figure 2: Width (top) and coverage (bottom) of the confidence intervals produced by the methods. The dashed line in the bottom plots is the 95% coverage target. Shaded areas indicate 95% prediction intervals over 500 independent runs.
  • Figure 3: Width of the confidence intervals for increasing levels of LLM bias ($\beta$, top-row) and oracle-enhanced LLM accuracy ($\tau \rightarrow 1$, bottom row) with $n=112$ on TREC-DL and $n=125$ on Robust04. Shaded areas indicate 95% prediction intervals over 500 independent runs. Coverage plots are omitted since all methods maintain >95% coverage.
  • Figure 4: 95% CI produced per-query by CRC using LLM predicted relevance annotations ($\tau = 0$) and oracle-enhanced LLM annotations ($\tau > 0$). The queries are sorted by their true DCG performance (according to human-annotations), indicated by red and green dots. Green dots are covered by their CI whereas red dots are not. Blue dots indicate the predicted DCG performance (according to LLM-generated annotations). Clearly, the CI shrink considerably as annotations become more accurate ($\tau \rightarrow 1$).