Table of Contents
Fetching ...

Supervised Learning-enhanced Multi-Group Actor Critic for Live Stream Allocation in Feed

Jingxin Liu, Xiang Gao, Yisha Li, Xin Li, Haiyang Lu, Ben Wang

TL;DR

This work tackles the problem of allocating live streams in a short video plus live stream mixed recommender system under platform-level constraints to maximize long-term engagement. It introduces SL-MGAC, a supervised learning–enhanced multi-group actor-critic framework that uses MG-SD to reduce variance, distribution discretization with multi-task reward learning to stabilize critic updates, and layer normalization to mitigate RL divergence. The approach combines a shared feature extractor, a multi-group actor-critic core, and teacher–student distillation to yield a robust online policy that performs well in offline NCIS evaluation and online A/B tests, while keeping online inference fast. Empirical results show superior cumulative reward and markedly improved stability compared with baselines, with practical deployment on a Kwai platform-scale RS and clear guidance on parameter choices and stability mechanisms for industrial RL systems.

Abstract

In the context of a short video & live stream mixed recommendation scenario, the live stream recommendation system (RS) decides whether to allocate at most one live stream into the video feed for each user request. To maximize long-term user engagement, it is crucial to determine an optimal live stream policy for accurate live stream allocation. The inappropriate live stream allocation policy can significantly affect the duration of the usage app and user retention, which ignores the long-term negative impact of live stream allocation. Recently, reinforcement learning (RL) has been widely applied in recommendation systems to capture long-term user engagement. However, traditional RL algorithms often face divergence and instability problems, which restricts the application and deployment in the large-scale industrial recommendation systems, especially in the aforementioned challenging scenario. To address these challenges, we propose a novel Supervised Learning-enhanced Multi-Group Actor Critic algorithm (SL-MGAC). Specifically, we introduce a supervised learning-enhanced actor-critic framework that incorporates variance reduction techniques, where multi-task reward learning helps restrict bootstrapping error accumulation during critic learning. Additionally, we design a multi-group state decomposition module for both actor and critic networks to reduce prediction variance and improve model stability. We also propose a novel reward function to prevent overly greedy live stream allocation. Empirically, we evaluate the SL-MGAC algorithm using offline policy evaluation (OPE) and online A/B testing. Experimental results demonstrate that the proposed method not only outperforms baseline methods under the platform-level constraints but also exhibits enhanced stability in online recommendation scenarios.

Supervised Learning-enhanced Multi-Group Actor Critic for Live Stream Allocation in Feed

TL;DR

This work tackles the problem of allocating live streams in a short video plus live stream mixed recommender system under platform-level constraints to maximize long-term engagement. It introduces SL-MGAC, a supervised learning–enhanced multi-group actor-critic framework that uses MG-SD to reduce variance, distribution discretization with multi-task reward learning to stabilize critic updates, and layer normalization to mitigate RL divergence. The approach combines a shared feature extractor, a multi-group actor-critic core, and teacher–student distillation to yield a robust online policy that performs well in offline NCIS evaluation and online A/B tests, while keeping online inference fast. Empirical results show superior cumulative reward and markedly improved stability compared with baselines, with practical deployment on a Kwai platform-scale RS and clear guidance on parameter choices and stability mechanisms for industrial RL systems.

Abstract

In the context of a short video & live stream mixed recommendation scenario, the live stream recommendation system (RS) decides whether to allocate at most one live stream into the video feed for each user request. To maximize long-term user engagement, it is crucial to determine an optimal live stream policy for accurate live stream allocation. The inappropriate live stream allocation policy can significantly affect the duration of the usage app and user retention, which ignores the long-term negative impact of live stream allocation. Recently, reinforcement learning (RL) has been widely applied in recommendation systems to capture long-term user engagement. However, traditional RL algorithms often face divergence and instability problems, which restricts the application and deployment in the large-scale industrial recommendation systems, especially in the aforementioned challenging scenario. To address these challenges, we propose a novel Supervised Learning-enhanced Multi-Group Actor Critic algorithm (SL-MGAC). Specifically, we introduce a supervised learning-enhanced actor-critic framework that incorporates variance reduction techniques, where multi-task reward learning helps restrict bootstrapping error accumulation during critic learning. Additionally, we design a multi-group state decomposition module for both actor and critic networks to reduce prediction variance and improve model stability. We also propose a novel reward function to prevent overly greedy live stream allocation. Empirically, we evaluate the SL-MGAC algorithm using offline policy evaluation (OPE) and online A/B testing. Experimental results demonstrate that the proposed method not only outperforms baseline methods under the platform-level constraints but also exhibits enhanced stability in online recommendation scenarios.

Paper Structure

This paper contains 29 sections, 16 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Structure of a short video & live stream mixed recommendation system(RS). The decision making of SL-MGAC takes place in the final stage of live stream RS.
  • Figure 2: Overall framework of the SL-MGAC algorithm. The SL (RL) MG-SD Module is short for the Multi-Group State Decomposition Module for supervised reward learning and critic learning.
  • Figure 3: System Architecture of the SL-MGAC algorithm.
  • Figure 4: The Q value curves between SL-MGAC and SL-MGAC (w/o MG) over 10 rounds of training. The lines correspond to the means of Q-value and the shaded areas correspond to the standard deviations (std).
  • Figure 5: Performance of different numbers of user group $K$.
  • ...and 3 more figures