Towards Two-Stage Counterfactual Learning to Rank
Shashank Gupta, Yiming Liao, Maarten de Rijke
TL;DR
This paper tackles the scalability challenge of counterfactual learning to rank (CLTR) in real-world settings by introducing a two-stage CLTR estimator that explicitly models the interaction between a candidate generator and a ranker. It then proposes a joint, alternating optimization procedure to train both stages offline, addressing distribution shifts that arise when switching candidate generators. The key contributions are the formal two-stage CLTR objective hat{U}(\\pi_r,\\pi_c), the accompanying gradient-based optimization with REINFORCE-style updates, and empirical evidence on a semi-synthetic MovieLens-1M setup showing superior performance over two-stage baselines. The work enables CLTR to scale to large candidate pools and provides a practical framework for jointly optimizing candidate generation and ranking in offline settings.
Abstract
Counterfactual learning to rank (CLTR) aims to learn a ranking policy from user interactions while correcting for the inherent biases in interaction data, such as position bias. Existing CLTR methods assume a single ranking policy that selects top-K ranking from the entire document candidate set. In real-world applications, the candidate document set is on the order of millions, making a single-stage ranking policy impractical. In order to scale to millions of documents, real-world ranking systems are designed in a two-stage fashion, with a candidate generator followed by a ranker. The existing CLTR method for a two-stage offline ranking system only considers the top-1 ranking set-up and only focuses on training the candidate generator, with the ranker fixed. A CLTR method for training both the ranker and candidate generator jointly is missing from the existing literature. In this paper, we propose a two-stage CLTR estimator that considers the interaction between the two stages and estimates the joint value of the two policies offline. In addition, we propose a novel joint optimization method to train the candidate and ranker policies, respectively. To the best of our knowledge, we are the first to propose a CLTR estimator and learning method for two-stage ranking. Experimental results on a semi-synthetic benchmark demonstrate the effectiveness of the proposed joint CLTR method over baselines.
