Table of Contents
Fetching ...

Formalized Information Needs Improve Large-Language-Model Relevance Judgments

Jüri Keller, Maik Fröbe, Björn Engelmann, Fabian Haak, Timo Breuer, Birger Larsen, Philipp Schaer

Abstract

Cranfield-style retrieval evaluations with too few or too many relevant documents or with low inter-assessor agreement on relevance can reduce the reliability of observations. In evaluations with human assessors, information needs are often formalized as retrieval topics to avoid an excessive number of relevant documents while maintaining good agreement. However, emerging evaluation setups that use Large Language Models (LLMs) as relevance assessors often use only queries, potentially decreasing the reliability. To study whether LLM relevance assessors benefit from formalized information needs, we synthetically formalize information needs with LLMs into topics that follow the established structure from previous human relevance assessments (i.e., descriptions and narratives). We compare assessors using synthetically formalized topics against the LLM-default query-only assessor on Robust04 and the 2019/2020 editions of TREC Deep Learning. We find that assessors without formalization judge many more documents relevant and have a lower agreement, leading to reduced reliability in retrieval evaluations. Furthermore, we show that the formalized topics improve agreement between human and LLM relevance judgments, even when the topics are not highly similar to their human counterparts. Our findings indicate that LLM relevance assessors should use formalized information needs, as is standard for human assessment, and synthetically formalize topics when no human formalization exists to improve evaluation reliability.

Formalized Information Needs Improve Large-Language-Model Relevance Judgments

Abstract

Cranfield-style retrieval evaluations with too few or too many relevant documents or with low inter-assessor agreement on relevance can reduce the reliability of observations. In evaluations with human assessors, information needs are often formalized as retrieval topics to avoid an excessive number of relevant documents while maintaining good agreement. However, emerging evaluation setups that use Large Language Models (LLMs) as relevance assessors often use only queries, potentially decreasing the reliability. To study whether LLM relevance assessors benefit from formalized information needs, we synthetically formalize information needs with LLMs into topics that follow the established structure from previous human relevance assessments (i.e., descriptions and narratives). We compare assessors using synthetically formalized topics against the LLM-default query-only assessor on Robust04 and the 2019/2020 editions of TREC Deep Learning. We find that assessors without formalization judge many more documents relevant and have a lower agreement, leading to reduced reliability in retrieval evaluations. Furthermore, we show that the formalized topics improve agreement between human and LLM relevance judgments, even when the topics are not highly similar to their human counterparts. Our findings indicate that LLM relevance assessors should use formalized information needs, as is standard for human assessment, and synthetically formalize topics when no human formalization exists to improve evaluation reliability.

Paper Structure

This paper contains 12 sections, 5 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: General form of the prompts used to synthesize information needs. <Italicised> words are placeholders, filled with appropriate contexts. Shaded text is optional, included in some prompt variants indicated by the color coding.
  • Figure 2: Comparison of the label alignment between TREC and LLM judgments that rely on synthesized topics. Only prompts that consider the query are considered in this plot.
  • Figure 3: Label agreement distribution of TREC and LLM judgments that use synthetic topics per context level and across all prompts for R04.
  • Figure 4: BERTScore similarity between topics synthesized by gpt-oss-120b and the R04 topics by prompt, compared across the whole topic and for single components.
  • Figure 5: Relative length of the topics synthesized by gpt-oss-120b compared to the R04 reference topics.