RewardRank: Optimizing True Learning-to-Rank Utility
Gaurav Bhatt, Kiran Koshy Thekumparampil, Tanmay Gangwani, Tesi Xiao, Leonid Sigal
TL;DR
RewardRank reframes learning-to-rank as direct optimization of counterfactual user utility by learning a permutation-aware reward model from logged interactions and differentiably maximizing expected reward through SoftSort-based ranking. It introduces two scalable evaluation protocols, PO-Eval and LAU-Eval, to assess counterfactual performance without online experiments, revealing substantial gaps between offline proxies like NDCG and true utility. Empirical results show RewardRank achieves the highest counterfactual utility across both benchmarks and sets a new state of the art on offline relevance with real Baidu-ULTR clicks, all while avoiding explicit position-bias assumptions. The approach demonstrates that data-driven, end-to-end optimization of permutation-level utility yields practical gains in ranking tasks with complex user behavior, diversity, and biases.
Abstract
Traditional ranking systems optimize offline proxy objectives that rely on oversimplified assumptions about user behavior, often neglecting factors such as position bias and item diversity. Consequently, these models fail to improve true counterfactual utilities such as such as click-through rate or purchase probability, when evaluated in online A/B tests. We introduce RewardRank, a data-driven learning-to-rank (LTR) framework for counterfactual utility maximization. RewardRank first learns a reward model that predicts the utility of any ranking directly from logged user interactions, and then trains a ranker to maximize this reward using a differentiable soft permutation operator. To enable rigorous and reproducible evaluation, we further propose two benchmark suites: (i) Parametric Oracle Evaluation (PO-Eval), which employs an open-source click model as a counterfactual oracle on the Baidu-ULTR dataset, and (ii) LLM-as-User Evaluation (LAU-Eval), which simulates realistic user behavior via large language models on the Amazon-KDD-Cup dataset. RewardRank achieves the highest counterfactual utility across both benchmarks and demonstrates that optimizing classical metrics such as NDCG is sub-optimal for maximizing true user utility. Finally, using real user feedback from the Baidu-ULTR dataset, RewardRank establishes a new state of the art in offline relevance performance. Overall, our results show that learning-to-rank can be reformulated as direct optimization of counterfactual utility, achieved in a purely data-driven manner without relying on explicit modeling assumptions such as position bias. Our code is available at: $https://github.com/GauravBh1010tt/RewardRank$
