Table of Contents
Fetching ...

Multi-Objective Recommendation via Multivariate Policy Learning

Olivier Jeunen, Jatin Mandav, Ivan Potapov, Nakul Agarwal, Sourabh Vaid, Wenzhe Shi, Aleksei Ustimenko

TL;DR

This work extends existing policy learning methods to the continuous multivariate action domain, proposing to maximise a pessimistic lower bound on the North Star reward that the learnt policy will yield, and provides guidance to design stochastic data collection policies, as well as highly sensitive reward signals.

Abstract

Real-world recommender systems often need to balance multiple objectives when deciding which recommendations to present to users. These include behavioural signals (e.g. clicks, shares, dwell time), as well as broader objectives (e.g. diversity, fairness). Scalarisation methods are commonly used to handle this balancing task, where a weighted average of per-objective reward signals determines the final score used for ranking. Naturally, how these weights are computed exactly, is key to success for any online platform. We frame this as a decision-making task, where the scalarisation weights are actions taken to maximise an overall North Star reward (e.g. long-term user retention or growth). We extend existing policy learning methods to the continuous multivariate action domain, proposing to maximise a pessimistic lower bound on the North Star reward that the learnt policy will yield. Typical lower bounds based on normal approximations suffer from insufficient coverage, and we propose an efficient and effective policy-dependent correction for this. We provide guidance to design stochastic data collection policies, as well as highly sensitive reward signals. Empirical observations from simulations, offline and online experiments highlight the efficacy of our deployed approach.

Multi-Objective Recommendation via Multivariate Policy Learning

TL;DR

This work extends existing policy learning methods to the continuous multivariate action domain, proposing to maximise a pessimistic lower bound on the North Star reward that the learnt policy will yield, and provides guidance to design stochastic data collection policies, as well as highly sensitive reward signals.

Abstract

Real-world recommender systems often need to balance multiple objectives when deciding which recommendations to present to users. These include behavioural signals (e.g. clicks, shares, dwell time), as well as broader objectives (e.g. diversity, fairness). Scalarisation methods are commonly used to handle this balancing task, where a weighted average of per-objective reward signals determines the final score used for ranking. Naturally, how these weights are computed exactly, is key to success for any online platform. We frame this as a decision-making task, where the scalarisation weights are actions taken to maximise an overall North Star reward (e.g. long-term user retention or growth). We extend existing policy learning methods to the continuous multivariate action domain, proposing to maximise a pessimistic lower bound on the North Star reward that the learnt policy will yield. Typical lower bounds based on normal approximations suffer from insufficient coverage, and we propose an efficient and effective policy-dependent correction for this. We provide guidance to design stochastic data collection policies, as well as highly sensitive reward signals. Empirical observations from simulations, offline and online experiments highlight the efficacy of our deployed approach.
Paper Structure (18 sections, 18 equations, 4 figures, 1 table)

This paper contains 18 sections, 18 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Visualising the curse of dimensionality and counterintuitive properties of the Uniform distribution in high dimensions. Left: Whilst all distances are equally likely in a single dimension, this property quickly disappears in higher dimensions. Middle: The Normal distribution exhibits similar but less pronounced behaviour, with a lower slope on the cumulative density. Right: Uniform sampling disfavours a hypersphere around the mean. Normal sampling can help to partially alleviate this.
  • Figure 2: Off-policy evaluation results for $\widehat{V}_{\rm SNIPS}$, considering the coverage of C.I.s obtained through various variations of our proposed ESS correction. ESS-corrected methods attain target coverage at significantly reduced sample sizes.
  • Figure 3: All considered C.I. corrections significantly reduce the required sample size for the C.I. to achieve its specified coverage level: down to 60$\times$ on average for $D_{N}^{R-\infty}$.
  • Figure 4: Off-policy evaluation results for varying $\sigma$, visualising the bias-variance trade-off that comes with the kernel smoothing technique. We provide results for an insensitive North Star (a), and a learnt reward signal that maximises statistical power (b).