Fast Slate Policy Optimization: Going Beyond Plackett-Luce
Otmane Sakhi, David Rohde, Nicolas Chopin
TL;DR
The paper tackles offline slate policy learning in environments with massive action spaces $P$, where optimal slate selection is intractable. It critiques Plackett-Luce due to high gradient variance and computational burden, and introduces Latent Gaussian Perturbation (LGP), a stochastic slate policy that perturbs latent scores $h_\theta(x)$ with Gaussian noise and uses approximate MIPS to achieve $\mathcal{O}(\log P)$ sampling. A key contribution is fixing action embeddings $\beta$ to dramatically reduce parameter count and gradient variance, while enabling an unbiased, low-variance gradient estimator whose complexity does not scale with slate size $K$. Empirical results on MovieLens 25M, Twitch, and Goodreads show that LGP, especially with MIPS acceleration (LGP-MIPS), consistently outperforms Plackett-Luce baselines under the same time budget, demonstrating robustness across different $S$ and $K$ settings and scalability to billion-scale action spaces. The work thus offers a practical, scalable alternative for large-scale slate policy optimization with arbitrary reward structures.
Abstract
An increasingly important building block of large scale machine learning systems is based on returning slates; an ordered lists of items given a query. Applications of this technology include: search, information retrieval and recommender systems. When the action space is large, decision systems are restricted to a particular structure to complete online queries quickly. This paper addresses the optimization of these large scale decision systems given an arbitrary reward function. We cast this learning problem in a policy optimization framework and propose a new class of policies, born from a novel relaxation of decision functions. This results in a simple, yet efficient learning algorithm that scales to massive action spaces. We compare our method to the commonly adopted Plackett-Luce policy class and demonstrate the effectiveness of our approach on problems with action space sizes in the order of millions.
