Table of Contents
Fetching ...

AURO: Reinforcement Learning for Adaptive User Retention Optimization in Recommender Systems

Zhenghai Xue, Qingpeng Cai, Bin Yang, Lantao Hu, Peng Jiang, Kun Gai, Bo An

TL;DR

AURO tackles non-stationary user behavior in reinforcement-learning–driven recommendations by introducing a value-aligned state abstraction module and a guarded online exploration mechanism. The state abstraction, trained with a value-based loss, provides a universal signal of environment shifts and feeds into an actor–critic policy to adapt dynamically, while guarded exploration uses an optimistic Q-estimate with rejection sampling to maintain safe online learning. Empirical results across a retention simulator, MovieLens, and a live platform show AURO consistently outperforms baselines in stability, adaptation, and key retention metrics. The work advances practical RL for recommender systems by addressing distribution shifts and implicit cold starts, enabling more robust long-term user retention in dynamic environments.

Abstract

The field of Reinforcement Learning (RL) has garnered increasing attention for its ability of optimizing user retention in recommender systems. A primary obstacle in this optimization process is the environment non-stationarity stemming from the continual and complex evolution of user behavior patterns over time, such as variations in interaction rates and retention propensities. These changes pose significant challenges to existing RL algorithms for recommendations, leading to issues with dynamics and reward distribution shifts. This paper introduces a novel approach called \textbf{A}daptive \textbf{U}ser \textbf{R}etention \textbf{O}ptimization (AURO) to address this challenge. To navigate the recommendation policy in non-stationary environments, AURO introduces an state abstraction module in the policy network. The module is trained with a new value-based loss function, aligning its output with the estimated performance of the current policy. As the policy performance of RL is sensitive to environment drifts, the loss function enables the state abstraction to be reflective of environment changes and notify the recommendation policy to adapt accordingly. Additionally, the non-stationarity of the environment introduces the problem of implicit cold start, where the recommendation policy continuously interacts with users displaying novel behavior patterns. AURO encourages exploration guarded by performance-based rejection sampling to maintain a stable recommendation quality in the cost-sensitive online environment. Extensive empirical analysis are conducted in a user retention simulator, the MovieLens dataset, and a live short-video recommendation platform, demonstrating AURO's superior performance against all evaluated baseline algorithms.

AURO: Reinforcement Learning for Adaptive User Retention Optimization in Recommender Systems

TL;DR

AURO tackles non-stationary user behavior in reinforcement-learning–driven recommendations by introducing a value-aligned state abstraction module and a guarded online exploration mechanism. The state abstraction, trained with a value-based loss, provides a universal signal of environment shifts and feeds into an actor–critic policy to adapt dynamically, while guarded exploration uses an optimistic Q-estimate with rejection sampling to maintain safe online learning. Empirical results across a retention simulator, MovieLens, and a live platform show AURO consistently outperforms baselines in stability, adaptation, and key retention metrics. The work advances practical RL for recommender systems by addressing distribution shifts and implicit cold starts, enabling more robust long-term user retention in dynamic environments.

Abstract

The field of Reinforcement Learning (RL) has garnered increasing attention for its ability of optimizing user retention in recommender systems. A primary obstacle in this optimization process is the environment non-stationarity stemming from the continual and complex evolution of user behavior patterns over time, such as variations in interaction rates and retention propensities. These changes pose significant challenges to existing RL algorithms for recommendations, leading to issues with dynamics and reward distribution shifts. This paper introduces a novel approach called \textbf{A}daptive \textbf{U}ser \textbf{R}etention \textbf{O}ptimization (AURO) to address this challenge. To navigate the recommendation policy in non-stationary environments, AURO introduces an state abstraction module in the policy network. The module is trained with a new value-based loss function, aligning its output with the estimated performance of the current policy. As the policy performance of RL is sensitive to environment drifts, the loss function enables the state abstraction to be reflective of environment changes and notify the recommendation policy to adapt accordingly. Additionally, the non-stationarity of the environment introduces the problem of implicit cold start, where the recommendation policy continuously interacts with users displaying novel behavior patterns. AURO encourages exploration guarded by performance-based rejection sampling to maintain a stable recommendation quality in the cost-sensitive online environment. Extensive empirical analysis are conducted in a user retention simulator, the MovieLens dataset, and a live short-video recommendation platform, demonstrating AURO's superior performance against all evaluated baseline algorithms.
Paper Structure (21 sections, 10 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 21 sections, 10 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: The procedure of optimizing user retention with RL.
  • Figure 2: Left: Normalized ratio of immediate user feedback among all data samples in the live environment, compared on different dates in one month and different hours in one day. Right: The distribution of user return time in three consecutive weeks. The return probabilities at different days exhibit variability over time.
  • Figure 3: Overview of the AURO framework. (a) The user state encoding module that embed user features, item features, and the item history into a low-dimentional state vector. (b) The actor-critic module with a state abstraction network that generates the latent feature vector $\phi(s)$. The state vector is concatenated with $\phi(s)$ before serving as the input to the actor and critic networks. (c) The exploration module for selecting exploration actions that interact with the recommendation environment.
  • Figure 4: Left: The demonstration of action selection with optimism under uncertainty in Eq. \ref{['eq:oac']}. The action can miss the local optimums of the state-action value function; Right: The state-action value function on two example state-action pairs in the user retention simulator zhao2023kuaisim. The vertical lines show the relative values of the original and exploration action in one dimension. The exploration action generated by Eq. \ref{['eq:oac']} with a fixed step size $\delta$ can lead to a lower value than the original action.
  • Figure 5: Performance comparison of different algorithms in the modified KuaiSim simulator. Metrics with the up arrow (↑) are expected to have larger values and vice versa.
  • ...and 1 more figures