Fast Slate Policy Optimization: Going Beyond Plackett-Luce

Otmane Sakhi; David Rohde; Nicolas Chopin

Fast Slate Policy Optimization: Going Beyond Plackett-Luce

Otmane Sakhi, David Rohde, Nicolas Chopin

TL;DR

The paper tackles offline slate policy learning in environments with massive action spaces $P$, where optimal slate selection is intractable. It critiques Plackett-Luce due to high gradient variance and computational burden, and introduces Latent Gaussian Perturbation (LGP), a stochastic slate policy that perturbs latent scores $h_\theta(x)$ with Gaussian noise and uses approximate MIPS to achieve $\mathcal{O}(\log P)$ sampling. A key contribution is fixing action embeddings $\beta$ to dramatically reduce parameter count and gradient variance, while enabling an unbiased, low-variance gradient estimator whose complexity does not scale with slate size $K$. Empirical results on MovieLens 25M, Twitch, and Goodreads show that LGP, especially with MIPS acceleration (LGP-MIPS), consistently outperforms Plackett-Luce baselines under the same time budget, demonstrating robustness across different $S$ and $K$ settings and scalability to billion-scale action spaces. The work thus offers a practical, scalable alternative for large-scale slate policy optimization with arbitrary reward structures.

Abstract

An increasingly important building block of large scale machine learning systems is based on returning slates; an ordered lists of items given a query. Applications of this technology include: search, information retrieval and recommender systems. When the action space is large, decision systems are restricted to a particular structure to complete online queries quickly. This paper addresses the optimization of these large scale decision systems given an arbitrary reward function. We cast this learning problem in a policy optimization framework and propose a new class of policies, born from a novel relaxation of decision functions. This results in a simple, yet efficient learning algorithm that scales to massive action spaces. We compare our method to the commonly adopted Plackett-Luce policy class and demonstrate the effectiveness of our approach on problems with action space sizes in the order of millions.

Fast Slate Policy Optimization: Going Beyond Plackett-Luce

TL;DR

The paper tackles offline slate policy learning in environments with massive action spaces

, where optimal slate selection is intractable. It critiques Plackett-Luce due to high gradient variance and computational burden, and introduces Latent Gaussian Perturbation (LGP), a stochastic slate policy that perturbs latent scores

with Gaussian noise and uses approximate MIPS to achieve

sampling. A key contribution is fixing action embeddings

to dramatically reduce parameter count and gradient variance, while enabling an unbiased, low-variance gradient estimator whose complexity does not scale with slate size

. Empirical results on MovieLens 25M, Twitch, and Goodreads show that LGP, especially with MIPS acceleration (LGP-MIPS), consistently outperforms Plackett-Luce baselines under the same time budget, demonstrating robustness across different

and

settings and scalability to billion-scale action spaces. The work thus offers a practical, scalable alternative for large-scale slate policy optimization with arbitrary reward structures.

Abstract

Paper Structure (26 sections, 49 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 26 sections, 49 equations, 5 figures, 4 tables, 2 algorithms.

INTRODUCTION
Plackett-Luce Policies
A Simple Definition
Optimizing The Objective
Computational Burden.
Variance Problems.
Fixing The Action Embeddings
We can still learn more about the actions.
Latent Gaussian Perturbation
Experiments
Experimental Setting
Performance under the same time budget
Performance under the same number of iterations
Related work
Learning from interactions.
...and 11 more sections

Figures (5)

Figure 1: The performance of slate decision functions obtained after optimizing them by different training algorithms for the same time budget of 60 minutes. Each evaluation on the validation data is done after 6 minutes of training.
Figure 2: Comparing the performance of slate decision functions obtained by running our different training algorithms for the same iterations. The newly proposed relaxation LGP have a gradient estimate of quality comparable to PL-Rank as they result in policies with a similar training behaviour. LGP methods enjoy low variance gradient estimate without making an assumption on the structure of the reward, contrary to PL-Rank.
Figure 3: Experiments on the MovieLens dataset: We look at the effect of fixing the action embeddings $\beta$ on the training of Plackett-Luce policies. Training $\beta$ results in a slow optimization procedure and gradient estimates with bigger variance.
Figure 4: Impact of the slate size $K$ on the log variance of the gradient estimate of both PL-PG and LGP on MovieLens. Contrary to LGP, PL-PG has a gradient estimate with a variance that grows with $K$.
Figure 5: The performance of slate decision functions, with Neural Network Backbones, obtained by our different training algorithms after running the optimization for 60 minutes.

Fast Slate Policy Optimization: Going Beyond Plackett-Luce

TL;DR

Abstract

Fast Slate Policy Optimization: Going Beyond Plackett-Luce

Authors

TL;DR

Abstract

Table of Contents

Figures (5)