Table of Contents
Fetching ...

Consolidating Ranking and Relevance Predictions of Large Language Models through Post-Processing

Le Yan, Zhen Qin, Honglei Zhuang, Rolf Jagerman, Xuanhui Wang, Michael Bendersky, Harrie Oosterhuis

TL;DR

The paper tackles the mismatch between LLM-based relevance labeling and pairwise ranking prompted by PRP in search tasks. It introduces a zero-shot post-processing approach using constrained regression to minimally perturb LLM relevance scores so they respect PRP-derived ranking constraints, complemented by efficiency strategies (SlideWin, TopAll) and a ranking-aware pseudo-rater pipeline. Evaluations on BEIR-derived and public ranking datasets show the method achieves competitive NDCG@10 while maintaining good calibration (ECE) and MSE, often surpassing baselines and supervised variants. The approach provides a practical, scalable way to unify ranking and labeling capabilities of LLMs for improved search applications.

Abstract

The powerful generative abilities of large language models (LLMs) show potential in generating relevance labels for search applications. Previous work has found that directly asking about relevancy, such as ``How relevant is document A to query Q?", results in sub-optimal ranking. Instead, the pairwise ranking prompting (PRP) approach produces promising ranking performance through asking about pairwise comparisons, e.g., ``Is document A more relevant than document B to query Q?". Thus, while LLMs are effective at their ranking ability, this is not reflected in their relevance label generation. In this work, we propose a post-processing method to consolidate the relevance labels generated by an LLM with its powerful ranking abilities. Our method takes both LLM generated relevance labels and pairwise preferences. The labels are then altered to satisfy the pairwise preferences of the LLM, while staying as close to the original values as possible. Our experimental results indicate that our approach effectively balances label accuracy and ranking performance. Thereby, our work shows it is possible to combine both the ranking and labeling abilities of LLMs through post-processing.

Consolidating Ranking and Relevance Predictions of Large Language Models through Post-Processing

TL;DR

The paper tackles the mismatch between LLM-based relevance labeling and pairwise ranking prompted by PRP in search tasks. It introduces a zero-shot post-processing approach using constrained regression to minimally perturb LLM relevance scores so they respect PRP-derived ranking constraints, complemented by efficiency strategies (SlideWin, TopAll) and a ranking-aware pseudo-rater pipeline. Evaluations on BEIR-derived and public ranking datasets show the method achieves competitive NDCG@10 while maintaining good calibration (ECE) and MSE, often surpassing baselines and supervised variants. The approach provides a practical, scalable way to unify ranking and labeling capabilities of LLMs for improved search applications.

Abstract

The powerful generative abilities of large language models (LLMs) show potential in generating relevance labels for search applications. Previous work has found that directly asking about relevancy, such as ``How relevant is document A to query Q?", results in sub-optimal ranking. Instead, the pairwise ranking prompting (PRP) approach produces promising ranking performance through asking about pairwise comparisons, e.g., ``Is document A more relevant than document B to query Q?". Thus, while LLMs are effective at their ranking ability, this is not reflected in their relevance label generation. In this work, we propose a post-processing method to consolidate the relevance labels generated by an LLM with its powerful ranking abilities. Our method takes both LLM generated relevance labels and pairwise preferences. The labels are then altered to satisfy the pairwise preferences of the LLM, while staying as close to the original values as possible. Our experimental results indicate that our approach effectively balances label accuracy and ranking performance. Thereby, our work shows it is possible to combine both the ranking and labeling abilities of LLMs through post-processing.
Paper Structure (28 sections, 9 equations, 3 figures, 6 tables)

This paper contains 28 sections, 9 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Left: Example of PRP scores not calibrated over different queries. Right: Illustration of the ranking-aware pseudo-rater pipeline that generates ranking-aware ratings with LLMs from the input query and list of candidate documents.
  • Figure 2: Illustration of how to select LLM pairwise constraints in SlideWin and TopAll methods. Top: SlideWin method with window size 2 and stride 1 takes $o(kn)$ successive pair comparisons, illustrated by paired arrows, to sort for top $k$ results from some initial ranking. Bottom: TopAll method considers top-$k$ results from an initial ranking and their pairwise constraints with all other results, shown by $o(kn)$ double-headed arrows.
  • Figure 3: Tradeoff plots on ECE versus NDCG@10 on five ranking datasets. NDCG@10 is higher the better and ECE is lower the better. Overall better methods are on the top right corner of the plots. Lines correspond to the Pareto fronts of Ensemble of PRater and PRP by tuning the weight $w$ in Eq. \ref{['eq:ensemble']}. Our consolidation methods in Table \ref{['tbl:result']} are scattered in the Figure.