UNEX-RL: Reinforcing Long-Term Rewards in Multi-Stage Recommender Systems with UNidirectional EXecution

Gengrui Zhang; Yao Wang; Xiaoshuang Chen; Hongyi Qian; Kaiqiao Zhan; Ben Wang

UNEX-RL: Reinforcing Long-Term Rewards in Multi-Stage Recommender Systems with UNidirectional EXecution

Gengrui Zhang, Yao Wang, Xiaoshuang Chen, Hongyi Qian, Kaiqiao Zhan, Ben Wang

TL;DR

UNEX-RL tackles long-term reward optimization in industrial multi-stage recommender systems by introducing unidirectional execution and CIC-based training to manage observation dependencies across stages. It optimizes the objective $R_t = \sum_{t'=t}^{\infty}\gamma^{t'-1} r_{t'}$ with a cascade information chain that reconstructs downstream observations and a global critic to stabilize learning, enhanced by variance-reduction techniques SG and CQR. Two major contributions are (i) CIC to address observation dependency and cascading effect, and (ii) variance reduction techniques that improve training stability; the method shows improvements on public data and in online deployment with over $10^8$ users, including a daily WatchTime gain of $0.953\%$ over a CEM baseline and $0.558\%$ over TD3. The work provides a practical MARL framework for production recommender systems and highlights the importance of addressing observation dependency and cascading effects in multi-stage pipelines.

Abstract

In recent years, there has been a growing interest in utilizing reinforcement learning (RL) to optimize long-term rewards in recommender systems. Since industrial recommender systems are typically designed as multi-stage systems, RL methods with a single agent face challenges when optimizing multiple stages simultaneously. The reason is that different stages have different observation spaces, and thus cannot be modeled by a single agent. To address this issue, we propose a novel UNidirectional-EXecution-based multi-agent Reinforcement Learning (UNEX-RL) framework to reinforce the long-term rewards in multi-stage recommender systems. We show that the unidirectional execution is a key feature of multi-stage recommender systems, bringing new challenges to the applications of multi-agent reinforcement learning (MARL), namely the observation dependency and the cascading effect. To tackle these challenges, we provide a cascading information chain (CIC) method to separate the independent observations from action-dependent observations and use CIC to train UNEX-RL effectively. We also discuss practical variance reduction techniques for UNEX-RL. Finally, we show the effectiveness of UNEX-RL on both public datasets and an online recommender system with over 100 million users. Specifically, UNEX-RL reveals a 0.558% increase in users' usage time compared with single-agent RL algorithms in online A/B experiments, highlighting the effectiveness of UNEX-RL in industrial recommender systems.

UNEX-RL: Reinforcing Long-Term Rewards in Multi-Stage Recommender Systems with UNidirectional EXecution

TL;DR

with a cascade information chain that reconstructs downstream observations and a global critic to stabilize learning, enhanced by variance-reduction techniques SG and CQR. Two major contributions are (i) CIC to address observation dependency and cascading effect, and (ii) variance reduction techniques that improve training stability; the method shows improvements on public data and in online deployment with over

users, including a daily WatchTime gain of

over a CEM baseline and

over TD3. The work provides a practical MARL framework for production recommender systems and highlights the importance of addressing observation dependency and cascading effects in multi-stage pipelines.

Abstract

Paper Structure (28 sections, 1 theorem, 16 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 28 sections, 1 theorem, 16 equations, 8 figures, 3 tables, 1 algorithm.

Introduction
Related Work
RL in Recommender Systems
MARL
Preliminary
Multi-Stage Recommender System
Problem Formulation
MARL
Method
Overall Framework
Training of UNEX-RL
Variance Reduction Techniques
Stopping Gradient (SG)
Category Quantile Rescale (CQR)
Evaluation
...and 13 more sections

Key Result

Theorem 4.1

Denote $\mathcal{E}(\cdot)$ as the information implied by input variables. Assume that the parameters $P^i$ of the information extraction in Eq. eq:v-t are given for all $i$. Then $\forall i, 2\leq i \leq N$, the set $\left\{\boldsymbol{\tau}^{1}_t, \boldsymbol{a}^{1:i-1}_t\right\}$ contains all the

Figures (8)

Figure 1: Long-term rewards in recommender systems.
Figure 2: A multi-stage recommender system.
Figure 3: Overall framework of UNEX-RL.
Figure 4: Train with CTDE and CIC.
Figure 5: Performance of different numbers of agents.
...and 3 more figures

Theorems & Definitions (2)

Theorem 4.1
proof

UNEX-RL: Reinforcing Long-Term Rewards in Multi-Stage Recommender Systems with UNidirectional EXecution

TL;DR

Abstract

UNEX-RL: Reinforcing Long-Term Rewards in Multi-Stage Recommender Systems with UNidirectional EXecution

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (2)