Table of Contents
Fetching ...

Hybrid Pooling with LLMs via Relevance Context Learning

David Otero, Javier Parapar

TL;DR

This work tackles the high cost of manual relevance judgments for IR test collections by introducing Relevance Context Learning (RCL) and a hybrid pooling strategy. RCL uses an Instructor LLM to produce topic-specific relevance narratives from human judgments, then guides a separate Assessor LLM with these narratives to judge new query-document pairs, enabling efficient, interpretable relevance assessment. The hybrid pooling approach splits the pool into a shallow, human-judged portion and a deeper, LLM-judged portion, improving scalability while maintaining quality. Empirical results on DL-19, DL-20, and TREC-8 show that RCL matches or surpasses standard In-Context Learning, with pronounced gains on long documents and reduced input costs, signaling a shift toward explicit relevance-context modeling for robust, cost-effective IR dataset construction.

Abstract

High-quality relevance judgements over large query sets are essential for evaluating Information Retrieval (IR) systems, yet manual annotation remains costly and time-consuming. Large Language Models (LLMs) have recently shown promise as automatic relevance assessors, but their reliability is still limited. Most existing approaches rely on zero-shot prompting or In-Context Learning (ICL) with a small number of labeled examples. However, standard ICL treats examples as independent instances and fails to explicitly capture the underlying relevance criteria of a topic, restricting its ability to generalize to unseen query-document pairs. To address this limitation, we introduce Relevance Context Learning (RCL), a novel framework that leverages human relevance judgements to explicitly model topic-specific relevance criteria. Rather than directly using labeled examples for in-context prediction, RCL first prompts an LLM (Instructor LLM) to analyze sets of judged query-document pairs and generate explicit narratives that describe what constitutes relevance for a given topic. These relevance narratives are then used as structured prompts to guide a second LLM (Assessor LLM) in producing relevance judgements. To evaluate RCL in a realistic data collection setting, we propose a hybrid pooling strategy in which a shallow depth-\textit{k} pool from participating systems is judged by human assessors, while the remaining documents are labeled by LLMs. Experimental results demonstrate that RCL substantially outperforms zero-shot prompting and consistently improves over standard ICL. Overall, our findings indicate that transforming relevance examples into explicit, context-aware relevance narratives is a more effective way of exploiting human judgements for LLM-based IR dataset construction.

Hybrid Pooling with LLMs via Relevance Context Learning

TL;DR

This work tackles the high cost of manual relevance judgments for IR test collections by introducing Relevance Context Learning (RCL) and a hybrid pooling strategy. RCL uses an Instructor LLM to produce topic-specific relevance narratives from human judgments, then guides a separate Assessor LLM with these narratives to judge new query-document pairs, enabling efficient, interpretable relevance assessment. The hybrid pooling approach splits the pool into a shallow, human-judged portion and a deeper, LLM-judged portion, improving scalability while maintaining quality. Empirical results on DL-19, DL-20, and TREC-8 show that RCL matches or surpasses standard In-Context Learning, with pronounced gains on long documents and reduced input costs, signaling a shift toward explicit relevance-context modeling for robust, cost-effective IR dataset construction.

Abstract

High-quality relevance judgements over large query sets are essential for evaluating Information Retrieval (IR) systems, yet manual annotation remains costly and time-consuming. Large Language Models (LLMs) have recently shown promise as automatic relevance assessors, but their reliability is still limited. Most existing approaches rely on zero-shot prompting or In-Context Learning (ICL) with a small number of labeled examples. However, standard ICL treats examples as independent instances and fails to explicitly capture the underlying relevance criteria of a topic, restricting its ability to generalize to unseen query-document pairs. To address this limitation, we introduce Relevance Context Learning (RCL), a novel framework that leverages human relevance judgements to explicitly model topic-specific relevance criteria. Rather than directly using labeled examples for in-context prediction, RCL first prompts an LLM (Instructor LLM) to analyze sets of judged query-document pairs and generate explicit narratives that describe what constitutes relevance for a given topic. These relevance narratives are then used as structured prompts to guide a second LLM (Assessor LLM) in producing relevance judgements. To evaluate RCL in a realistic data collection setting, we propose a hybrid pooling strategy in which a shallow depth-\textit{k} pool from participating systems is judged by human assessors, while the remaining documents are labeled by LLMs. Experimental results demonstrate that RCL substantially outperforms zero-shot prompting and consistently improves over standard ICL. Overall, our findings indicate that transforming relevance examples into explicit, context-aware relevance narratives is a more effective way of exploiting human judgements for LLM-based IR dataset construction.
Paper Structure (24 sections, 3 figures, 3 tables, 2 algorithms)

This paper contains 24 sections, 3 figures, 3 tables, 2 algorithms.

Figures (3)

  • Figure 1: The hybrid pooling approach using Relevance Context Learning framework. The process begins with a shallow depth-$k$ pool where human assessors provide initial relevance judgements. These judgements are fed into the Instructor LLM, which analyzes the query--document pairs to synthesize explicit Relevance Narratives. These narratives serve as structured prompts for the Assessor LLM, which labels the remaining documents in the pool. The final Hybrid Pooled Data combines human and LLM-generated labels to create a cost-effective, high-quality IR evaluation dataset.
  • Figure 2: Per-query differences in F1 scores across ten executions of the ICL 1-shot setting on TREC-DL 2019. Each boxplot represents the distribution of pairwise F1 differences for a given query, obtained by randomly sampling one document as an in-context example in each execution.
  • Figure 3: Schematic overview of the narrative generation, the Instructor LLM processes the input query and a set of judged documents to synthesize a high-level Relevance Narrative and granular Judging Instructions. The example displays the output for TREC DL-19 query 1115776 ("what is an aml surveillance analyst") when including only relevant documents.