Hybrid Pooling with LLMs via Relevance Context Learning
David Otero, Javier Parapar
TL;DR
This work tackles the high cost of manual relevance judgments for IR test collections by introducing Relevance Context Learning (RCL) and a hybrid pooling strategy. RCL uses an Instructor LLM to produce topic-specific relevance narratives from human judgments, then guides a separate Assessor LLM with these narratives to judge new query-document pairs, enabling efficient, interpretable relevance assessment. The hybrid pooling approach splits the pool into a shallow, human-judged portion and a deeper, LLM-judged portion, improving scalability while maintaining quality. Empirical results on DL-19, DL-20, and TREC-8 show that RCL matches or surpasses standard In-Context Learning, with pronounced gains on long documents and reduced input costs, signaling a shift toward explicit relevance-context modeling for robust, cost-effective IR dataset construction.
Abstract
High-quality relevance judgements over large query sets are essential for evaluating Information Retrieval (IR) systems, yet manual annotation remains costly and time-consuming. Large Language Models (LLMs) have recently shown promise as automatic relevance assessors, but their reliability is still limited. Most existing approaches rely on zero-shot prompting or In-Context Learning (ICL) with a small number of labeled examples. However, standard ICL treats examples as independent instances and fails to explicitly capture the underlying relevance criteria of a topic, restricting its ability to generalize to unseen query-document pairs. To address this limitation, we introduce Relevance Context Learning (RCL), a novel framework that leverages human relevance judgements to explicitly model topic-specific relevance criteria. Rather than directly using labeled examples for in-context prediction, RCL first prompts an LLM (Instructor LLM) to analyze sets of judged query-document pairs and generate explicit narratives that describe what constitutes relevance for a given topic. These relevance narratives are then used as structured prompts to guide a second LLM (Assessor LLM) in producing relevance judgements. To evaluate RCL in a realistic data collection setting, we propose a hybrid pooling strategy in which a shallow depth-\textit{k} pool from participating systems is judged by human assessors, while the remaining documents are labeled by LLMs. Experimental results demonstrate that RCL substantially outperforms zero-shot prompting and consistently improves over standard ICL. Overall, our findings indicate that transforming relevance examples into explicit, context-aware relevance narratives is a more effective way of exploiting human judgements for LLM-based IR dataset construction.
