Table of Contents
Fetching ...

RL-MPCA: A Reinforcement Learning Based Multi-Phase Computation Allocation Approach for Recommender Systems

Jiahong Zhou, Shunhui Mao, Guoliang Yang, Bo Tang, Qianlong Xie, Lebin Lin, Xingxing Wang, Dong Wang

TL;DR

RL-MPCA addresses resource-constrained, multi-phase CR allocation in recommender systems by formulating the problem as a Weakly Coupled MDP and introducing a modular Q-network with a Constraint Layer. Adaptive-$\lambda$ mechanisms during offline training and lambda-correction during evaluation ensure per-phase budgets are respected while maximizing revenue. The approach is validated via offline simulation on a large Meituan dataset and online A/B tests, where RL-MPCA outperforms baselines under CR constraints. The work demonstrates practical impact for industrial systems by improving revenue with controlled resource usage and system stability.

Abstract

Recommender systems aim to recommend the most suitable items to users from a large number of candidates. Their computation cost grows as the number of user requests and the complexity of services (or models) increases. Under the limitation of computation resources (CRs), how to make a trade-off between computation cost and business revenue becomes an essential question. The existing studies focus on dynamically allocating CRs in queue truncation scenarios (i.e., allocating the size of candidates), and formulate the CR allocation problem as an optimization problem with constraints. Some of them focus on single-phase CR allocation, and others focus on multi-phase CR allocation but introduce some assumptions about queue truncation scenarios. However, these assumptions do not hold in other scenarios, such as retrieval channel selection and prediction model selection. Moreover, existing studies ignore the state transition process of requests between different phases, limiting the effectiveness of their approaches. This paper proposes a Reinforcement Learning (RL) based Multi-Phase Computation Allocation approach (RL-MPCA), which aims to maximize the total business revenue under the limitation of CRs. RL-MPCA formulates the CR allocation problem as a Weakly Coupled MDP problem and solves it with an RL-based approach. Specifically, RL-MPCA designs a novel deep Q-network to adapt to various CR allocation scenarios, and calibrates the Q-value by introducing multiple adaptive Lagrange multipliers (adaptive-$λ$) to avoid violating the global CR constraints. Finally, experiments on the offline simulation environment and online real-world recommender system validate the effectiveness of our approach.

RL-MPCA: A Reinforcement Learning Based Multi-Phase Computation Allocation Approach for Recommender Systems

TL;DR

RL-MPCA addresses resource-constrained, multi-phase CR allocation in recommender systems by formulating the problem as a Weakly Coupled MDP and introducing a modular Q-network with a Constraint Layer. Adaptive- mechanisms during offline training and lambda-correction during evaluation ensure per-phase budgets are respected while maximizing revenue. The approach is validated via offline simulation on a large Meituan dataset and online A/B tests, where RL-MPCA outperforms baselines under CR constraints. The work demonstrates practical impact for industrial systems by improving revenue with controlled resource usage and system stability.

Abstract

Recommender systems aim to recommend the most suitable items to users from a large number of candidates. Their computation cost grows as the number of user requests and the complexity of services (or models) increases. Under the limitation of computation resources (CRs), how to make a trade-off between computation cost and business revenue becomes an essential question. The existing studies focus on dynamically allocating CRs in queue truncation scenarios (i.e., allocating the size of candidates), and formulate the CR allocation problem as an optimization problem with constraints. Some of them focus on single-phase CR allocation, and others focus on multi-phase CR allocation but introduce some assumptions about queue truncation scenarios. However, these assumptions do not hold in other scenarios, such as retrieval channel selection and prediction model selection. Moreover, existing studies ignore the state transition process of requests between different phases, limiting the effectiveness of their approaches. This paper proposes a Reinforcement Learning (RL) based Multi-Phase Computation Allocation approach (RL-MPCA), which aims to maximize the total business revenue under the limitation of CRs. RL-MPCA formulates the CR allocation problem as a Weakly Coupled MDP problem and solves it with an RL-based approach. Specifically, RL-MPCA designs a novel deep Q-network to adapt to various CR allocation scenarios, and calibrates the Q-value by introducing multiple adaptive Lagrange multipliers (adaptive-) to avoid violating the global CR constraints. Finally, experiments on the offline simulation environment and online real-world recommender system validate the effectiveness of our approach.
Paper Structure (30 sections, 1 theorem, 18 equations, 7 figures, 4 tables, 3 algorithms)

This paper contains 30 sections, 1 theorem, 18 equations, 7 figures, 4 tables, 3 algorithms.

Key Result

lemma 1

Suppose Assumptions (assumption:1) and (assumption:2) hold, for any $\lambda_t^k$, let $\lambda_t^{k+1}$ be: where $C_t$ is the computation budget of phase $t$ and $\alpha \in \mathbb{R^{+}}$ is learning rate of $\lambda$. Then, the following conclusion holds:

Figures (7)

  • Figure 1: The typical structure of recommender systems.
  • Figure 2: Request query procedure of recommender systems in a three-phase computation resource allocation situation.
  • Figure 3: Q-Network of RL-MPCA. It first models each phase using a separate network and calibrates the Q-value with the constraint layer. Then the selection unit selects the q-logits of a specific phase based on the phase number $t$.
  • Figure 4: The Overview of System Architecture.
  • Figure 5: Offline experiment results for adaptive-$\lambda$ and $\lambda$ correction on multiple Deep Q-Network models. Agents are evaluated every 5,000 steps, and averaged over 5 seeds.
  • ...and 2 more figures

Theorems & Definitions (2)

  • lemma 1
  • proof