Explain then Rank: Scale Calibration of Neural Rankers Using Natural Language Explanations from LLMs

Puxuan Yu; Daniel Cohen; Hemank Lamba; Joel Tetreault; Alex Jaimes

Explain then Rank: Scale Calibration of Neural Rankers Using Natural Language Explanations from LLMs

Puxuan Yu, Daniel Cohen, Hemank Lamba, Joel Tetreault, Alex Jaimes

TL;DR

The paper tackles scale calibration for neural text rankers by leveraging natural language explanations (NLEs) generated by large language models (LLMs). It converts the ranking task to scoring NLEs via a two-part model: a frozen LLM g_{ Psi} that produces e^q for each query-document pair and a trainable ranker f_{\Theta} that scores the NLEs, i.e., $\phi_{\Phi}(q,\{d^q\}) = f_{\Theta}(g_{\Psi}(q,\{d^q\})) = f_{\Theta}(\{e^q\})$. The method explores literal and conditional prompting for NLEs, employs Monte Carlo sampling to form meta-NLEs, and aggregates multiple NLEs to encode uncertainty. Experiments on TREC and NTCIR show consistent improvements in calibration (CB‑ECE, ECE, MSE) and ranking (nDCG, nDCG@10), as well as downstream query performance prediction (QPP) metrics, relative to strong baselines, with reproducible results using Llama2-13B-Chat and BERT-based rankers. The approach offers a practical path to usable, interpretable ranking scores in large neural rankers, while acknowledging latency and bias considerations that invite future work on efficiency and robustness.

Abstract

In search settings, calibrating the scores during the ranking process to quantities such as click-through rates or relevance levels enhances a system's usefulness and trustworthiness for downstream users. While previous research has improved this notion of calibration for low complexity learning-to-rank models, the larger data demands and parameter count specific to modern neural text rankers produce unique obstacles that hamper the efficacy of methods intended for the learning-to-rank setting. This paper proposes exploiting large language models (LLMs) to provide relevance and uncertainty signals for these neural text rankers to produce scale-calibrated scores through Monte Carlo sampling of natural language explanations (NLEs). Our approach transforms the neural ranking task from ranking textual query-document pairs to ranking corresponding synthesized NLEs. Comprehensive experiments on two popular document ranking datasets show that the NLE-based calibration approach consistently outperforms past calibration methods and LLM-based methods for ranking, calibration, and query performance prediction tasks.

Explain then Rank: Scale Calibration of Neural Rankers Using Natural Language Explanations from LLMs

TL;DR

. The method explores literal and conditional prompting for NLEs, employs Monte Carlo sampling to form meta-NLEs, and aggregates multiple NLEs to encode uncertainty. Experiments on TREC and NTCIR show consistent improvements in calibration (CB‑ECE, ECE, MSE) and ranking (nDCG, nDCG@10), as well as downstream query performance prediction (QPP) metrics, relative to strong baselines, with reproducible results using Llama2-13B-Chat and BERT-based rankers. The approach offers a practical path to usable, interpretable ranking scores in large neural rankers, while acknowledging latency and bias considerations that invite future work on efficiency and robustness.

Abstract

Paper Structure (24 sections, 5 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 24 sections, 5 equations, 3 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Methodology
Problem Statement and Motivation
Scale Calibration via Natural Language Explanations
Acquiring NLEs via LLM Prompting
Literal Explanation
Conditional Explanation
Aggregating Multiple NLEs
Experiments
Data
Metrics
Baselines
Downstream Performance: QPP
Reproducibility
...and 9 more sections

Figures (3)

Figure 1: The key idea of this study: Neural ranking models struggle to produce meaningful ranking scores when encountering complex query-document pairs. We investigate the integration of natural language explanations as inputs to neural rankers, aiming to simplify the scale-calibrated ranking task for these rankers.
Figure 2: Ranking and scale calibration performance on TREC of full calibration of BERT, taking query + document inputs (FC BERT) and our proposed explanations, using four different optimization objectives. NLE-based approaches consistently yield better ranking (left) and calibration (right) performance.
Figure 3: Reliability diagrams for two models on TREC: The left diagram shows a model with ranking scores densely concentrated on the lower part of the scale, which exhibits better ECE performance due to ECE's failure to account for prediction coverage across the target scale. On the right, the CB-ECE penalizes this undesirable behavior, indicating that the model providing better coverage across the scale is more effectively calibrated.

Explain then Rank: Scale Calibration of Neural Rankers Using Natural Language Explanations from LLMs

TL;DR

Abstract

Explain then Rank: Scale Calibration of Neural Rankers Using Natural Language Explanations from LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (3)