Table of Contents
Fetching ...

UOEP: User-Oriented Exploration Policy for Enhancing Long-Term User Experiences in Recommender Systems

Changshuo Zhang, Sirui Chen, Xiao Zhang, Sunhao Dai, Weijie Yu, Jun Xu

TL;DR

UOEP tackles the challenge of optimizing long-term user experiences in large-scale recommender systems by modeling the return distribution $\mathcal{Z}^\pi(s,a)$ and optimizing tail risk via $\operatorname{CVaR}_\alpha$. It introduces a population of $m$ actors, each targeting a different $\operatorname{CVaR}_{\alpha_i}$ under a distributional critic implemented with an Implicit Quantile Network, enabling fine-grained, user-group-specific exploration. To sustain learning quality, UOEP adds a population diversity regularizer and a stability supervision module, with their relative emphasis balanced by a Thompson-sampling bandit. Empirical results on three public/industrial datasets show consistent gains in long-term metrics and notable improvements for low-activity users, accompanied by reduced fairness gaps, demonstrating the method's practical impact in enhancing enduring user satisfaction.

Abstract

Reinforcement learning (RL) has gained traction for enhancing user long-term experiences in recommender systems by effectively exploring users' interests. However, modern recommender systems exhibit distinct user behavioral patterns among tens of millions of items, which increases the difficulty of exploration. For example, user behaviors with different activity levels require varying intensity of exploration, while previous studies often overlook this aspect and apply a uniform exploration strategy to all users, which ultimately hurts user experiences in the long run. To address these challenges, we propose User-Oriented Exploration Policy (UOEP), a novel approach facilitating fine-grained exploration among user groups. We first construct a distributional critic which allows policy optimization under varying quantile levels of cumulative reward feedbacks from users, representing user groups with varying activity levels. Guided by this critic, we devise a population of distinct actors aimed at effective and fine-grained exploration within its respective user group. To simultaneously enhance diversity and stability during the exploration process, we further introduce a population-level diversity regularization term and a supervision module. Experimental results on public recommendation datasets demonstrate that our approach outperforms all other baselines in terms of long-term performance, validating its user-oriented exploration effectiveness. Meanwhile, further analyses reveal our approach's benefits of improved performance for low-activity users as well as increased fairness among users.

UOEP: User-Oriented Exploration Policy for Enhancing Long-Term User Experiences in Recommender Systems

TL;DR

UOEP tackles the challenge of optimizing long-term user experiences in large-scale recommender systems by modeling the return distribution and optimizing tail risk via . It introduces a population of actors, each targeting a different under a distributional critic implemented with an Implicit Quantile Network, enabling fine-grained, user-group-specific exploration. To sustain learning quality, UOEP adds a population diversity regularizer and a stability supervision module, with their relative emphasis balanced by a Thompson-sampling bandit. Empirical results on three public/industrial datasets show consistent gains in long-term metrics and notable improvements for low-activity users, accompanied by reduced fairness gaps, demonstrating the method's practical impact in enhancing enduring user satisfaction.

Abstract

Reinforcement learning (RL) has gained traction for enhancing user long-term experiences in recommender systems by effectively exploring users' interests. However, modern recommender systems exhibit distinct user behavioral patterns among tens of millions of items, which increases the difficulty of exploration. For example, user behaviors with different activity levels require varying intensity of exploration, while previous studies often overlook this aspect and apply a uniform exploration strategy to all users, which ultimately hurts user experiences in the long run. To address these challenges, we propose User-Oriented Exploration Policy (UOEP), a novel approach facilitating fine-grained exploration among user groups. We first construct a distributional critic which allows policy optimization under varying quantile levels of cumulative reward feedbacks from users, representing user groups with varying activity levels. Guided by this critic, we devise a population of distinct actors aimed at effective and fine-grained exploration within its respective user group. To simultaneously enhance diversity and stability during the exploration process, we further introduce a population-level diversity regularization term and a supervision module. Experimental results on public recommendation datasets demonstrate that our approach outperforms all other baselines in terms of long-term performance, validating its user-oriented exploration effectiveness. Meanwhile, further analyses reveal our approach's benefits of improved performance for low-activity users as well as increased fairness among users.
Paper Structure (48 sections, 16 equations, 8 figures, 9 tables, 2 algorithms)

This paper contains 48 sections, 16 equations, 8 figures, 9 tables, 2 algorithms.

Figures (8)

  • Figure 1: Illustration Experiment. We sorted users based on their activity levels (i.e., CTR) and selected the five bottom $\alpha$-quantile ($\alpha\in\{0.2,0.4,0.6,0.8,1.0\}$) of user groups. We then trained two RL algorithms, DDPG and TD3, on these five groups under four varying noise levels. We conducted all experiments with four different seeds and reported the average results. For better presentation, we performed a shift on the values within each group, ensuring that the minimum value within each group becomes 0.1. Adding the value inside the parentheses on the horizontal axis yields the actual return.
  • Figure 2: The proposed approach UOEP. It includes a population of $m$ actors, where $m$ is the population size and each actor$_i$ outputs an action $a_i$ based on the current user state $s$. The action $a_i$ along with state $s$ is fed into the distributional critic. Afterward, utilizing both its quantile value $\alpha_i$ and the critic's output $Z(s, a_i;\cdot)$, actor$_i$ computes the conditional value at risk (CVaR) measure in order to derive its policy gradients.
  • Figure 3: Learning curves for 5 actors of UOEP, HAC, and Wolpertinger on three datasets.
  • Figure 4: The t-SNE visualization of the population.
  • Figure 5: Ablations for the number of actors (denoted by $m$) in UOEP on KuaiRand.
  • ...and 3 more figures