Learned Ranking Function: From Short-term Behavior Predictions to Long-term User Satisfaction
Yi Wu, Daryl Chang, Jennifer She, Zhe Zhao, Li Wei, Lukasz Heldt
TL;DR
The paper tackles the problem of optimizing long-term user satisfaction in slate-based recommendations by reframing ranking as slate optimization under a multi-objective MDP. It introduces the Learned Ranking Function (LRF), which uses a cascade click model and lift-based rewards to account for abandonment, and it employs a constrained optimization approach via dynamic linear scalarization to stabilize trade-offs across objectives. A practical on-policy Monte Carlo optimization framework trains separate user-item networks for abandonment, click, and lift signals, enabling inference that maximizes a learned $Q$-function. The approach is deployed on YouTube and validated through multiple live experiments, demonstrating improvements in long-term satisfaction and showcasing the value of lift formulations, cascade modeling, and offline-evaluation-guided weight adaptation for multi-objective slate optimization.
Abstract
We present the Learned Ranking Function (LRF), a system that takes short-term user-item behavior predictions as input and outputs a slate of recommendations that directly optimizes for long-term user satisfaction. Most previous work is based on optimizing the hyperparameters of a heuristic function. We propose to model the problem directly as a slate optimization problem with the objective of maximizing long-term user satisfaction. We also develop a novel constraint optimization algorithm that stabilizes objective trade-offs for multi-objective optimization. We evaluate our approach with live experiments and describe its deployment on YouTube.
