AURO: Reinforcement Learning for Adaptive User Retention Optimization in Recommender Systems
Zhenghai Xue, Qingpeng Cai, Bin Yang, Lantao Hu, Peng Jiang, Kun Gai, Bo An
TL;DR
AURO tackles non-stationary user behavior in reinforcement-learning–driven recommendations by introducing a value-aligned state abstraction module and a guarded online exploration mechanism. The state abstraction, trained with a value-based loss, provides a universal signal of environment shifts and feeds into an actor–critic policy to adapt dynamically, while guarded exploration uses an optimistic Q-estimate with rejection sampling to maintain safe online learning. Empirical results across a retention simulator, MovieLens, and a live platform show AURO consistently outperforms baselines in stability, adaptation, and key retention metrics. The work advances practical RL for recommender systems by addressing distribution shifts and implicit cold starts, enabling more robust long-term user retention in dynamic environments.
Abstract
The field of Reinforcement Learning (RL) has garnered increasing attention for its ability of optimizing user retention in recommender systems. A primary obstacle in this optimization process is the environment non-stationarity stemming from the continual and complex evolution of user behavior patterns over time, such as variations in interaction rates and retention propensities. These changes pose significant challenges to existing RL algorithms for recommendations, leading to issues with dynamics and reward distribution shifts. This paper introduces a novel approach called \textbf{A}daptive \textbf{U}ser \textbf{R}etention \textbf{O}ptimization (AURO) to address this challenge. To navigate the recommendation policy in non-stationary environments, AURO introduces an state abstraction module in the policy network. The module is trained with a new value-based loss function, aligning its output with the estimated performance of the current policy. As the policy performance of RL is sensitive to environment drifts, the loss function enables the state abstraction to be reflective of environment changes and notify the recommendation policy to adapt accordingly. Additionally, the non-stationarity of the environment introduces the problem of implicit cold start, where the recommendation policy continuously interacts with users displaying novel behavior patterns. AURO encourages exploration guarded by performance-based rejection sampling to maintain a stable recommendation quality in the cost-sensitive online environment. Extensive empirical analysis are conducted in a user retention simulator, the MovieLens dataset, and a live short-video recommendation platform, demonstrating AURO's superior performance against all evaluated baseline algorithms.
