Table of Contents
Fetching ...

RewardRank: Optimizing True Learning-to-Rank Utility

Gaurav Bhatt, Kiran Koshy Thekumparampil, Tanmay Gangwani, Tesi Xiao, Leonid Sigal

TL;DR

RewardRank reframes learning-to-rank as direct optimization of counterfactual user utility by learning a permutation-aware reward model from logged interactions and differentiably maximizing expected reward through SoftSort-based ranking. It introduces two scalable evaluation protocols, PO-Eval and LAU-Eval, to assess counterfactual performance without online experiments, revealing substantial gaps between offline proxies like NDCG and true utility. Empirical results show RewardRank achieves the highest counterfactual utility across both benchmarks and sets a new state of the art on offline relevance with real Baidu-ULTR clicks, all while avoiding explicit position-bias assumptions. The approach demonstrates that data-driven, end-to-end optimization of permutation-level utility yields practical gains in ranking tasks with complex user behavior, diversity, and biases.

Abstract

Traditional ranking systems optimize offline proxy objectives that rely on oversimplified assumptions about user behavior, often neglecting factors such as position bias and item diversity. Consequently, these models fail to improve true counterfactual utilities such as such as click-through rate or purchase probability, when evaluated in online A/B tests. We introduce RewardRank, a data-driven learning-to-rank (LTR) framework for counterfactual utility maximization. RewardRank first learns a reward model that predicts the utility of any ranking directly from logged user interactions, and then trains a ranker to maximize this reward using a differentiable soft permutation operator. To enable rigorous and reproducible evaluation, we further propose two benchmark suites: (i) Parametric Oracle Evaluation (PO-Eval), which employs an open-source click model as a counterfactual oracle on the Baidu-ULTR dataset, and (ii) LLM-as-User Evaluation (LAU-Eval), which simulates realistic user behavior via large language models on the Amazon-KDD-Cup dataset. RewardRank achieves the highest counterfactual utility across both benchmarks and demonstrates that optimizing classical metrics such as NDCG is sub-optimal for maximizing true user utility. Finally, using real user feedback from the Baidu-ULTR dataset, RewardRank establishes a new state of the art in offline relevance performance. Overall, our results show that learning-to-rank can be reformulated as direct optimization of counterfactual utility, achieved in a purely data-driven manner without relying on explicit modeling assumptions such as position bias. Our code is available at: $https://github.com/GauravBh1010tt/RewardRank$

RewardRank: Optimizing True Learning-to-Rank Utility

TL;DR

RewardRank reframes learning-to-rank as direct optimization of counterfactual user utility by learning a permutation-aware reward model from logged interactions and differentiably maximizing expected reward through SoftSort-based ranking. It introduces two scalable evaluation protocols, PO-Eval and LAU-Eval, to assess counterfactual performance without online experiments, revealing substantial gaps between offline proxies like NDCG and true utility. Empirical results show RewardRank achieves the highest counterfactual utility across both benchmarks and sets a new state of the art on offline relevance with real Baidu-ULTR clicks, all while avoiding explicit position-bias assumptions. The approach demonstrates that data-driven, end-to-end optimization of permutation-level utility yields practical gains in ranking tasks with complex user behavior, diversity, and biases.

Abstract

Traditional ranking systems optimize offline proxy objectives that rely on oversimplified assumptions about user behavior, often neglecting factors such as position bias and item diversity. Consequently, these models fail to improve true counterfactual utilities such as such as click-through rate or purchase probability, when evaluated in online A/B tests. We introduce RewardRank, a data-driven learning-to-rank (LTR) framework for counterfactual utility maximization. RewardRank first learns a reward model that predicts the utility of any ranking directly from logged user interactions, and then trains a ranker to maximize this reward using a differentiable soft permutation operator. To enable rigorous and reproducible evaluation, we further propose two benchmark suites: (i) Parametric Oracle Evaluation (PO-Eval), which employs an open-source click model as a counterfactual oracle on the Baidu-ULTR dataset, and (ii) LLM-as-User Evaluation (LAU-Eval), which simulates realistic user behavior via large language models on the Amazon-KDD-Cup dataset. RewardRank achieves the highest counterfactual utility across both benchmarks and demonstrates that optimizing classical metrics such as NDCG is sub-optimal for maximizing true user utility. Finally, using real user feedback from the Baidu-ULTR dataset, RewardRank establishes a new state of the art in offline relevance performance. Overall, our results show that learning-to-rank can be reformulated as direct optimization of counterfactual utility, achieved in a purely data-driven manner without relying on explicit modeling assumptions such as position bias. Our code is available at:

Paper Structure

This paper contains 38 sections, 1 theorem, 24 equations, 8 figures, 7 tables.

Key Result

Theorem 1

Let $\mathbf{r} = (r_1, \ldots, r_n) \in \mathbb{R}_{\ge 0}^n$ be a vector of predicted relevance scores, and let $\mathbf{e} = (e_1, \ldots, e_n) \in \mathbb{R}_{\ge 0}^n$ be a non-increasing sequence of examination probabilities: $e_1 \ge e_2 \ge \ldots \ge e_n$. Let $\pi^*$ be the permutation tha

Figures (8)

  • Figure 1: Counterfactual ranking with true learning-to-rank utility. Three arrangements for the query "laptop bag", with item relevance/rating scores (0–5*). The top row ranks purely by relevance but suffers from similarity aversion due to identical color and style, lowering engagement. The middle row improves diversity but surfaces low-relevance items early, which may deter clicks. The bottom row balances diversity and relevance, placing distinct yet relevant items in top positions, leading to higher predicted utility and user engagement. (Figures are generated by GPT-4o)
  • Figure 2: RewardRank. A ranker scores the items in a query group. These scores are used to compute soft item embeddings via a soft permutation matrix. Position encoded soft item embeddings are passed into a reward to estimate its utility. Finally, the ranker is optimized to maximize the predicted utility.
  • Figure 3: Reward misspecification correction on PO-Eval. Each point represents a ranked list with true utility (u: estimated by IPS-Oracle) and predicted utility ($\hat{u}$: estimated by utility model) from the ranker. Colors indicate ($w = 1 - \lambda |u_{\text{logged}} - \hat{u}_{\text{logged}}|$), showing how increasing ($\lambda$) down-weights overconfident or misaligned samples to emphasize well-calibrated predictions.
  • Figure 4: Distributions extracted from IPS-Oracle analysis on Baidu-ULTR.
  • Figure 5: Effect of Sampling Temperature on LLM-Simulated Behavior in LAU-Eval. We visualize the distribution of binary purchase decisions (top) and item positions (bottom) generated by Claude Sonnet 3.5 v2 under three sampling temperatures: 0.1, 0.5, and 0.75. Each sample corresponds to a ranked list generated during LAU-Eval. As temperature increases, the purchase signal slightly diversifies, while positional biases remain consistent across settings. These results suggest that LAU-Eval is robust to moderate sampling variability, with LLMs producing stable user-like behavior under soft prompting.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Theorem 1: Ideal Ranking Maximizes Utility via Rearrangement Inequality
  • proof