Table of Contents
Fetching ...

JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment

Hossein A. Rahmani, Emine Yilmaz, Nick Craswell, Bhaskar Mitra

TL;DR

JudgeBlender addresses the high cost of manual relevance judgments by leveraging ensembles of open-source LLMs and prompts to automate relevance labeling. It introduces two ensemble variants, PromptBlender and LLMBlender, and an assortment of aggregators to fuse judgments. Experiments on the LLMJudge/TREC-DL 2023-based dataset show that JudgeBlender achieves competitive correlation with human judgments and robust system rankings, often surpassing single-model baselines, while reducing biases associated with any one model. The work demonstrates that very large models are not strictly necessary for reliable relevance assessment, enabling more scalable and cost-efficient IR evaluation. It also outlines avenues for future expansion across prompts, models, datasets, and aggregation strategies.

Abstract

The effective training and evaluation of retrieval systems require a substantial amount of relevance judgments, which are traditionally collected from human assessors -- a process that is both costly and time-consuming. Large Language Models (LLMs) have shown promise in generating relevance labels for search tasks, offering a potential alternative to manual assessments. Current approaches often rely on a single LLM, such as GPT-4, which, despite being effective, are expensive and prone to intra-model biases that can favour systems leveraging similar models. In this work, we introduce JudgeBlender, a framework that employs smaller, open-source models to provide relevance judgments by combining evaluations across multiple LLMs (LLMBlender) or multiple prompts (PromptBlender). By leveraging the LLMJudge benchmark [18], we compare JudgeBlender with state-of-the-art methods and the top performers in the LLMJudge challenge. Our results show that JudgeBlender achieves competitive performance, demonstrating that very large models are often unnecessary for reliable relevance assessments.

JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment

TL;DR

JudgeBlender addresses the high cost of manual relevance judgments by leveraging ensembles of open-source LLMs and prompts to automate relevance labeling. It introduces two ensemble variants, PromptBlender and LLMBlender, and an assortment of aggregators to fuse judgments. Experiments on the LLMJudge/TREC-DL 2023-based dataset show that JudgeBlender achieves competitive correlation with human judgments and robust system rankings, often surpassing single-model baselines, while reducing biases associated with any one model. The work demonstrates that very large models are not strictly necessary for reliable relevance assessment, enabling more scalable and cost-efficient IR evaluation. It also outlines avenues for future expansion across prompts, models, datasets, and aggregation strategies.

Abstract

The effective training and evaluation of retrieval systems require a substantial amount of relevance judgments, which are traditionally collected from human assessors -- a process that is both costly and time-consuming. Large Language Models (LLMs) have shown promise in generating relevance labels for search tasks, offering a potential alternative to manual assessments. Current approaches often rely on a single LLM, such as GPT-4, which, despite being effective, are expensive and prone to intra-model biases that can favour systems leveraging similar models. In this work, we introduce JudgeBlender, a framework that employs smaller, open-source models to provide relevance judgments by combining evaluations across multiple LLMs (LLMBlender) or multiple prompts (PromptBlender). By leveraging the LLMJudge benchmark [18], we compare JudgeBlender with state-of-the-art methods and the top performers in the LLMJudge challenge. Our results show that JudgeBlender achieves competitive performance, demonstrating that very large models are often unnecessary for reliable relevance assessments.

Paper Structure

This paper contains 17 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The PromptBlender evaluation uses an LLM to grade how relevant a passage is to a query. The evaluation is done by multiple prompting to an LLM then aggregating the scores based on an aggregation function (Section \ref{['sec:aggregation-function']}).
  • Figure 2: The LLMBlender evaluation uses multiple LLMs to grade how relevant a passage is to a query. The evaluation is done by prompting several LLMs, then aggregating the scores based on an aggregation function (Section \ref{['sec:aggregation-function']}).
  • Figure 3: Scatter plots of the effectiveness of TREC Deep Learning track 2023 runs according to the TREC official human judgments and (a) RelExp, (b) MultiCriteria, (c) PromptBlender - MV(Avg.), and (b) LLMBlender - MV(Avg.) evaluated using NDCG@10.