Table of Contents
Fetching ...

Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels

Honglei Zhuang, Zhen Qin, Kai Hui, Junru Wu, Le Yan, Xuanhui Wang, Michael Bendersky

TL;DR

This work identifies the limitations of binary Yes/No prompts for zero-shot LLM rankers and introduces fine-grained relevance labels and rating-scale prompts to better discriminate document relevance. It formalizes two scoring schemes, Expected Relevance (ER) and Peak Relevance Likelihood (PR), that aggregate or sample likelihoods from LLM outputs to produce ranking scores. Across eight BEIR datasets, fine-grained prompts yield meaningful improvements in NDCG@10, with a robust preference for a small set of labels (3–4) and likelihood-based scoring over direct label generation. The approach demonstrates strong potential for improving zero-shot ranking in information retrieval and suggests broader applicability to related tasks, while identifying practical limitations and avenues for future work in calibration and label design.

Abstract

Zero-shot text rankers powered by recent LLMs achieve remarkable ranking performance by simply prompting. Existing prompts for pointwise LLM rankers mostly ask the model to choose from binary relevance labels like "Yes" and "No". However, the lack of intermediate relevance label options may cause the LLM to provide noisy or biased answers for documents that are partially relevant to the query. We propose to incorporate fine-grained relevance labels into the prompt for LLM rankers, enabling them to better differentiate among documents with different levels of relevance to the query and thus derive a more accurate ranking. We study two variants of the prompt template, coupled with different numbers of relevance levels. Our experiments on 8 BEIR data sets show that adding fine-grained relevance labels significantly improves the performance of LLM rankers.

Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels

TL;DR

This work identifies the limitations of binary Yes/No prompts for zero-shot LLM rankers and introduces fine-grained relevance labels and rating-scale prompts to better discriminate document relevance. It formalizes two scoring schemes, Expected Relevance (ER) and Peak Relevance Likelihood (PR), that aggregate or sample likelihoods from LLM outputs to produce ranking scores. Across eight BEIR datasets, fine-grained prompts yield meaningful improvements in NDCG@10, with a robust preference for a small set of labels (3–4) and likelihood-based scoring over direct label generation. The approach demonstrates strong potential for improving zero-shot ranking in information retrieval and suggests broader applicability to related tasks, while identifying practical limitations and avenues for future work in calibration and label design.

Abstract

Zero-shot text rankers powered by recent LLMs achieve remarkable ranking performance by simply prompting. Existing prompts for pointwise LLM rankers mostly ask the model to choose from binary relevance labels like "Yes" and "No". However, the lack of intermediate relevance label options may cause the LLM to provide noisy or biased answers for documents that are partially relevant to the query. We propose to incorporate fine-grained relevance labels into the prompt for LLM rankers, enabling them to better differentiate among documents with different levels of relevance to the query and thus derive a more accurate ranking. We study two variants of the prompt template, coupled with different numbers of relevance levels. Our experiments on 8 BEIR data sets show that adding fine-grained relevance labels significantly improves the performance of LLM rankers.
Paper Structure (35 sections, 4 equations, 7 figures, 8 tables)

This paper contains 35 sections, 4 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Illustration of different prompting strategies for relevance generation LLM rankers.
  • Figure 2: Average NDCG@10 on 8 BEIR data sets with different $k$ in rating scale $0$-to-$k$.
  • Figure 3: Comparing ranking score distribution of different methods on the Covid data set.
  • Figure 4: Comparing rating scale relevance generation with different prompts.
  • Figure 5: Distribution of marginal probability $p_k$ of each relevance label in RG-S$(0,4)$ for query-document pairs with different ground-truth labels on Covid data set
  • ...and 2 more figures