Supervised Learning-enhanced Multi-Group Actor Critic for Live Stream Allocation in Feed
Jingxin Liu, Xiang Gao, Yisha Li, Xin Li, Haiyang Lu, Ben Wang
TL;DR
This work tackles the problem of allocating live streams in a short video plus live stream mixed recommender system under platform-level constraints to maximize long-term engagement. It introduces SL-MGAC, a supervised learning–enhanced multi-group actor-critic framework that uses MG-SD to reduce variance, distribution discretization with multi-task reward learning to stabilize critic updates, and layer normalization to mitigate RL divergence. The approach combines a shared feature extractor, a multi-group actor-critic core, and teacher–student distillation to yield a robust online policy that performs well in offline NCIS evaluation and online A/B tests, while keeping online inference fast. Empirical results show superior cumulative reward and markedly improved stability compared with baselines, with practical deployment on a Kwai platform-scale RS and clear guidance on parameter choices and stability mechanisms for industrial RL systems.
Abstract
In the context of a short video & live stream mixed recommendation scenario, the live stream recommendation system (RS) decides whether to allocate at most one live stream into the video feed for each user request. To maximize long-term user engagement, it is crucial to determine an optimal live stream policy for accurate live stream allocation. The inappropriate live stream allocation policy can significantly affect the duration of the usage app and user retention, which ignores the long-term negative impact of live stream allocation. Recently, reinforcement learning (RL) has been widely applied in recommendation systems to capture long-term user engagement. However, traditional RL algorithms often face divergence and instability problems, which restricts the application and deployment in the large-scale industrial recommendation systems, especially in the aforementioned challenging scenario. To address these challenges, we propose a novel Supervised Learning-enhanced Multi-Group Actor Critic algorithm (SL-MGAC). Specifically, we introduce a supervised learning-enhanced actor-critic framework that incorporates variance reduction techniques, where multi-task reward learning helps restrict bootstrapping error accumulation during critic learning. Additionally, we design a multi-group state decomposition module for both actor and critic networks to reduce prediction variance and improve model stability. We also propose a novel reward function to prevent overly greedy live stream allocation. Empirically, we evaluate the SL-MGAC algorithm using offline policy evaluation (OPE) and online A/B testing. Experimental results demonstrate that the proposed method not only outperforms baseline methods under the platform-level constraints but also exhibits enhanced stability in online recommendation scenarios.
