Table of Contents
Fetching ...

ExaRanker: Explanation-Augmented Neural Ranker

Fernando Ferraretto, Thiago Laitz, Roberto Lotufo, Rodrigo Nogueira

TL;DR

ExaRanker tackles data efficiency in neural ranking by augmenting training with natural-language explanations generated by LLMs. The method finetunes a seq2seq ranker to output a true/false label plus an explanation, and uses only the first-token probability as the ranking score during inference, leaving explanations optional at query time. Empirically, explanation augmentation yields consistent gains across BEIR datasets in zero-shot settings, especially when labeled data is scarce, and allows achieving near the performance of much larger labeled datasets with a fraction of data. Ablation confirms the benefit primarily arises from explanations in the training signal rather than in inference cost, and reversing the generation order harms performance. Overall, ExaRanker demonstrates that explanations can be a cost-effective means to distill reasoning from LLMs into practical, on-demand IR ranking.

Abstract

Recent work has shown that inducing a large language model (LLM) to generate explanations prior to outputting an answer is an effective strategy to improve performance on a wide range of reasoning tasks. In this work, we show that neural rankers also benefit from explanations. We use LLMs such as GPT-3.5 to augment retrieval datasets with explanations and train a sequence-to-sequence ranking model to output a relevance label and an explanation for a given query-document pair. Our model, dubbed ExaRanker, finetuned on a few thousand examples with synthetic explanations performs on par with models finetuned on 3x more examples without explanations. Furthermore, the ExaRanker model incurs no additional computational cost during ranking and allows explanations to be requested on demand.

ExaRanker: Explanation-Augmented Neural Ranker

TL;DR

ExaRanker tackles data efficiency in neural ranking by augmenting training with natural-language explanations generated by LLMs. The method finetunes a seq2seq ranker to output a true/false label plus an explanation, and uses only the first-token probability as the ranking score during inference, leaving explanations optional at query time. Empirically, explanation augmentation yields consistent gains across BEIR datasets in zero-shot settings, especially when labeled data is scarce, and allows achieving near the performance of much larger labeled datasets with a fraction of data. Ablation confirms the benefit primarily arises from explanations in the training signal rather than in inference cost, and reversing the generation order harms performance. Overall, ExaRanker demonstrates that explanations can be a cost-effective means to distill reasoning from LLMs into practical, on-demand IR ranking.

Abstract

Recent work has shown that inducing a large language model (LLM) to generate explanations prior to outputting an answer is an effective strategy to improve performance on a wide range of reasoning tasks. In this work, we show that neural rankers also benefit from explanations. We use LLMs such as GPT-3.5 to augment retrieval datasets with explanations and train a sequence-to-sequence ranking model to output a relevance label and an explanation for a given query-document pair. Our model, dubbed ExaRanker, finetuned on a few thousand examples with synthetic explanations performs on par with models finetuned on 3x more examples without explanations. Furthermore, the ExaRanker model incurs no additional computational cost during ranking and allows explanations to be requested on demand.
Paper Structure (8 sections, 1 equation, 4 figures, 4 tables)

This paper contains 8 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Method overview.
  • Figure 2: Prompt used to generate explanations for a query-passage-label triple (presented in Python's f-string notation).
  • Figure 3: Illustration of input and generated outputs of a relevant (green) and non-relevant (red) query-passage pair.
  • Figure 4: Average zero-shot results on 6 datasets of the BEIR benchmark when varying the number of training examples. monoT5-400k is finetuned on the 400k relevant query-passage pairs from MS MARCO without explanations.