Learned Ranking Function: From Short-term Behavior Predictions to Long-term User Satisfaction

Yi Wu; Daryl Chang; Jennifer She; Zhe Zhao; Li Wei; Lukasz Heldt

Learned Ranking Function: From Short-term Behavior Predictions to Long-term User Satisfaction

Yi Wu, Daryl Chang, Jennifer She, Zhe Zhao, Li Wei, Lukasz Heldt

TL;DR

The paper tackles the problem of optimizing long-term user satisfaction in slate-based recommendations by reframing ranking as slate optimization under a multi-objective MDP. It introduces the Learned Ranking Function (LRF), which uses a cascade click model and lift-based rewards to account for abandonment, and it employs a constrained optimization approach via dynamic linear scalarization to stabilize trade-offs across objectives. A practical on-policy Monte Carlo optimization framework trains separate user-item networks for abandonment, click, and lift signals, enabling inference that maximizes a learned $Q$-function. The approach is deployed on YouTube and validated through multiple live experiments, demonstrating improvements in long-term satisfaction and showcasing the value of lift formulations, cascade modeling, and offline-evaluation-guided weight adaptation for multi-objective slate optimization.

Abstract

We present the Learned Ranking Function (LRF), a system that takes short-term user-item behavior predictions as input and outputs a slate of recommendations that directly optimizes for long-term user satisfaction. Most previous work is based on optimizing the hyperparameters of a heuristic function. We propose to model the problem directly as a slate optimization problem with the objective of maximizing long-term user satisfaction. We also develop a novel constraint optimization algorithm that stabilizes objective trade-offs for multi-objective optimization. We evaluate our approach with live experiments and describe its deployment on YouTube.

Learned Ranking Function: From Short-term Behavior Predictions to Long-term User Satisfaction

TL;DR

-function. The approach is deployed on YouTube and validated through multiple live experiments, demonstrating improvements in long-term satisfaction and showcasing the value of lift formulations, cascade modeling, and offline-evaluation-guided weight adaptation for multi-objective slate optimization.

Abstract

Paper Structure (28 sections, 1 theorem, 16 equations, 3 figures, 2 algorithms)

This paper contains 28 sections, 1 theorem, 16 equations, 3 figures, 2 algorithms.

Introduction and Related Work
Problem Formation
MDP Formulation
Lift Formulation with Cascade Click model
Optimization Algorithm
Single Objective Optimization
Training
Training the abandon reward network
Training the lift reward network
Training the click network
Inference
Constraint optimization
Offline evaluation on exploration candidates
Optimization with correlation constraint
Deployment and Evaluation
...and 13 more sections

Key Result

Theorem 2.1

Given user-item functions $p_{clk},p_{abd},R_{abd}^{\pi},R_{lift}^{\pi}$ as input, the optimal ranking for user $u$ on candidate $V$ maximizing $Q^\pi((u,V),\sigma)$ for a scalar reward function is to order all items $v\in V$ by $\frac{p_{clk}(u,v)}{p_{clk}(u,v)+p_{abd}(u,v)} \cdot R_{lift}^{\pi}(u,

Figures (3)

Figure 1: Markov Reward Process with Cascade Click Model
Figure 2: LRF deployment diagram
Figure 3: Metrics for experiments in Section \ref{['sec:launch']}(top left), \ref{['sec:cascade']} (top right), \ref{['sec:uplift']} (bottom left), and \ref{['sec:two']} (bottom right).

Theorems & Definitions (5)

definition 1
definition 2
Theorem 2.1
proof
definition 3

Learned Ranking Function: From Short-term Behavior Predictions to Long-term User Satisfaction

TL;DR

Abstract

Learned Ranking Function: From Short-term Behavior Predictions to Long-term User Satisfaction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (5)