JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment
Hossein A. Rahmani, Emine Yilmaz, Nick Craswell, Bhaskar Mitra
TL;DR
JudgeBlender addresses the high cost of manual relevance judgments by leveraging ensembles of open-source LLMs and prompts to automate relevance labeling. It introduces two ensemble variants, PromptBlender and LLMBlender, and an assortment of aggregators to fuse judgments. Experiments on the LLMJudge/TREC-DL 2023-based dataset show that JudgeBlender achieves competitive correlation with human judgments and robust system rankings, often surpassing single-model baselines, while reducing biases associated with any one model. The work demonstrates that very large models are not strictly necessary for reliable relevance assessment, enabling more scalable and cost-efficient IR evaluation. It also outlines avenues for future expansion across prompts, models, datasets, and aggregation strategies.
Abstract
The effective training and evaluation of retrieval systems require a substantial amount of relevance judgments, which are traditionally collected from human assessors -- a process that is both costly and time-consuming. Large Language Models (LLMs) have shown promise in generating relevance labels for search tasks, offering a potential alternative to manual assessments. Current approaches often rely on a single LLM, such as GPT-4, which, despite being effective, are expensive and prone to intra-model biases that can favour systems leveraging similar models. In this work, we introduce JudgeBlender, a framework that employs smaller, open-source models to provide relevance judgments by combining evaluations across multiple LLMs (LLMBlender) or multiple prompts (PromptBlender). By leveraging the LLMJudge benchmark [18], we compare JudgeBlender with state-of-the-art methods and the top performers in the LLMJudge challenge. Our results show that JudgeBlender achieves competitive performance, demonstrating that very large models are often unnecessary for reliable relevance assessments.
