Table of Contents
Fetching ...

Benchmarking LLM-based Relevance Judgment Methods

Negar Arabzadeh, Charles L. A. Clarke

TL;DR

The paper tackles the problem of evaluating LLM-based relevance judgments for IR by systematically comparing five methods (Binary, Graded, Nugget-based Exam, Nugget-based AutoNuggetizer, and Pairwise Preferences) using two LLMs across four datasets (TREC DL 2019–2021 and ANTIQUE). It introduces two evaluation axes—alignment with human labels (order preservation of relevance categories) and agreement with system rankings (compatibility via $\text{RBO}$ and $p=0.9$). The study provides a comprehensive, reproducible benchmark with public data and code, and reveals that Pairwise Preferences often align best with human judgments while Binary and UMBRELA tend to agree more with system rankings; nugget-based methods show dataset-dependent performance. The findings offer practical guidance for selecting LLM-based relevance assessment methods and advocate for auditable evaluation workflows to ensure robustness and interpretability in automated IR evaluation.

Abstract

Large Language Models (LLMs) are increasingly deployed in both academic and industry settings to automate the evaluation of information seeking systems, particularly by generating graded relevance judgments. Previous work on LLM-based relevance assessment has primarily focused on replicating graded human relevance judgments through various prompting strategies. However, there has been limited exploration of alternative assessment methods or comprehensive comparative studies. In this paper, we systematically compare multiple LLM-based relevance assessment methods, including binary relevance judgments, graded relevance assessments, pairwise preference-based methods, and two nugget-based evaluation methods~--~document-agnostic and document-dependent. In addition to a traditional comparison based on system rankings using Kendall correlations, we also examine how well LLM judgments align with human preferences, as inferred from relevance grades. We conduct extensive experiments on datasets from three TREC Deep Learning tracks 2019, 2020 and 2021 as well as the ANTIQUE dataset, which focuses on non-factoid open-domain question answering. As part of our data release, we include relevance judgments generated by both an open-source (Llama3.2b) and a commercial (gpt-4o) model. Our goal is to \textit{reproduce} various LLM-based relevance judgment methods to provide a comprehensive comparison. All code, data, and resources are publicly available in our GitHub Repository at https://github.com/Narabzad/llm-relevance-judgement-comparison.

Benchmarking LLM-based Relevance Judgment Methods

TL;DR

The paper tackles the problem of evaluating LLM-based relevance judgments for IR by systematically comparing five methods (Binary, Graded, Nugget-based Exam, Nugget-based AutoNuggetizer, and Pairwise Preferences) using two LLMs across four datasets (TREC DL 2019–2021 and ANTIQUE). It introduces two evaluation axes—alignment with human labels (order preservation of relevance categories) and agreement with system rankings (compatibility via and ). The study provides a comprehensive, reproducible benchmark with public data and code, and reveals that Pairwise Preferences often align best with human judgments while Binary and UMBRELA tend to agree more with system rankings; nugget-based methods show dataset-dependent performance. The findings offer practical guidance for selecting LLM-based relevance assessment methods and advocate for auditable evaluation workflows to ensure robustness and interpretability in automated IR evaluation.

Abstract

Large Language Models (LLMs) are increasingly deployed in both academic and industry settings to automate the evaluation of information seeking systems, particularly by generating graded relevance judgments. Previous work on LLM-based relevance assessment has primarily focused on replicating graded human relevance judgments through various prompting strategies. However, there has been limited exploration of alternative assessment methods or comprehensive comparative studies. In this paper, we systematically compare multiple LLM-based relevance assessment methods, including binary relevance judgments, graded relevance assessments, pairwise preference-based methods, and two nugget-based evaluation methods~--~document-agnostic and document-dependent. In addition to a traditional comparison based on system rankings using Kendall correlations, we also examine how well LLM judgments align with human preferences, as inferred from relevance grades. We conduct extensive experiments on datasets from three TREC Deep Learning tracks 2019, 2020 and 2021 as well as the ANTIQUE dataset, which focuses on non-factoid open-domain question answering. As part of our data release, we include relevance judgments generated by both an open-source (Llama3.2b) and a commercial (gpt-4o) model. Our goal is to \textit{reproduce} various LLM-based relevance judgment methods to provide a comprehensive comparison. All code, data, and resources are publicly available in our GitHub Repository at https://github.com/Narabzad/llm-relevance-judgement-comparison.

Paper Structure

This paper contains 24 sections, 7 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Human alignment results for different LLM-based Relevance Judgment methods on DL-19, DL-20, DL-21, and ANTIQUE (top to bottom rows). Within each dataset, comparisons are shown across different relevance categories: Best vs. UnAcceptable, Acceptable vs. UnAcceptable, and Best vs. Acceptable (leftmost to rightmost columns). Darker colors reflect greater ease in distinguishing between the two categories of relevance.
  • Figure 2: Compatibility (LLM assessment) vs nDCG@10 (human assessment) for the relevance assessment methods with the highest Kendall correlation (see Table \ref{['tab:metrics']}) on runs submitted to TREC DL-20 and DL-21. Plots for all assessment methods and datasets are included in the GitHub repo.
  • Figure 3: Comparing average alignment with human preferences from Figure \ref{['fig:human_alignment']} with system ranking agreement from Table \ref{['tab:metrics']} on different relevance judgment methods across three TREC DL datasets.