Policy-Gradient Training of Language Models for Ranking

Ge Gao; Jonathan D. Chang; Claire Cardie; Kianté Brantley; Thorsten Joachim

Policy-Gradient Training of Language Models for Ranking

Ge Gao, Jonathan D. Chang, Claire Cardie, Kianté Brantley, Thorsten Joachim

TL;DR

Neural PG-RANK is introduced, a novel training algorithm that learns to rank by instantiating a LLM as a Plackett-Luce ranking policy, with little reliance on complex heuristics, and it effectively unifies the training objective with downstream decision-making quality.

Abstract

Text retrieval plays a crucial role in incorporating factual knowledge for decision making into language processing pipelines, ranging from chat-based web search to question answering systems. Current state-of-the-art text retrieval models leverage pre-trained large language models (LLMs) to achieve competitive performance, but training LLM-based retrievers via typical contrastive losses requires intricate heuristics, including selecting hard negatives and using additional supervision as learning signals. This reliance on heuristics stems from the fact that the contrastive loss itself is heuristic and does not directly optimize the downstream metrics of decision quality at the end of the processing pipeline. To address this issue, we introduce Neural PG-RANK, a novel training algorithm that learns to rank by instantiating a LLM as a Plackett-Luce ranking policy. Neural PG-RANK provides a principled method for end-to-end training of retrieval models as part of larger decision systems via policy gradient, with little reliance on complex heuristics, and it effectively unifies the training objective with downstream decision-making quality. We conduct extensive experiments on various text retrieval benchmarks. The results demonstrate that when the training objective aligns with the evaluation setup, Neural PG-RANK yields remarkable in-domain performance improvement, with substantial out-of-domain generalization to some critical datasets employed in downstream question answering tasks.

Policy-Gradient Training of Language Models for Ranking

TL;DR

Abstract

Paper Structure (29 sections, 9 equations, 1 figure, 11 tables)

This paper contains 29 sections, 9 equations, 1 figure, 11 tables.

Introduction
Background and Related Work
Text Retrieval
Learning to Rank
Setting
Method
Plackett-Luce Ranking Policy
REINFORCE
Monte Carlo Sampling
Variance Reduction
Utility
Experimental Setup
Data
Evaluation Setup
Comparison System
...and 14 more sections

Figures (1)

Figure 1: Illustration of our Neural PG-RANK. Given a query and a collection of documents, a Placket-Luce ranking policy samples ranking, receives utility, and gets updated using policy gradient and the received utility. Our method can directly optimize any ranking metric of interest as utility, and allows end-to-end training of any differential policy. Query and document examples are from MS MARCO dataset Campos2016MSMA.

Policy-Gradient Training of Language Models for Ranking

TL;DR

Abstract

Policy-Gradient Training of Language Models for Ranking

Authors

TL;DR

Abstract

Table of Contents

Figures (1)