Cooperative Multi-Agent Deep Reinforcement Learning in Content Ranking Optimization
Zhou Qin, Kai Yuan, Pratik Lahiri, Wenyang Liu
TL;DR
This work reframes content ranking optimization (CRO) on multi-slot pages as a cooperative multi-agent reinforcement learning problem and solves it with MADDPG, enabling joint, whole-page optimization and emphasizing long-term rewards. The approach uses a centralized critic with decentralized actors to address non-stationarity, and defines a composite reward that includes revenue, profit, long-term value, clicks, and abandonment signals. Empirical results demonstrate MADDPG's scalability to large action spaces, outperforming deep bandits by around 25.7% on IPS in offline CRO data, and achieving incremental revenue in online A/B tests, while also showing strong performance in Mujoco benchmarks. The findings suggest that joint page-level optimization via multi-agent RL can significantly improve user experience and business metrics in information retrieval tasks, with potential applicability to other joint optimization problems in IR.
Abstract
In a typical e-commerce setting, Content Ranking Optimization (CRO) mechanisms are employed to surface content on the search page to fulfill customers' shopping missions. CRO commonly utilizes models such as contextual deep bandits model to independently rank content at different positions, e.g., one optimizer dedicated to organic search results and another to sponsored results. However, this regional optimization approach does not necessarily translate to whole page optimization, e.g., maximizing revenue at the top of the page may inadvertently diminish the revenue of lower positions. In this paper, we propose a reinforcement learning based method for whole page ranking to jointly optimize across all positions by: 1) shifting from position level optimization to whole page level optimization to achieve an overall optimized ranking; 2) applying reinforcement learning to optimize for the cumulative rewards instead of the instant reward. We formulate page level CRO as a cooperative Multi-agent Markov Decision Process , and address it with the novel Multi-Agent Deep Deterministic Policy Gradient (MADDPG) model. MADDPG supports a flexible and scalable joint optimization framework by adopting a "centralized training and decentralized execution" approach. Extensive experiments demonstrate that MADDPG scales to a 2.5 billion action space in the public Mujoco environment, and outperforms the deep bandits modeling by 25.7% on the offline CRO data set from a leading e-commerce company. We foresee that this novel multi-agent optimization is applicable to similar joint optimization problems in the field of information retrieval.
