Table of Contents
Fetching ...

MTRec: Learning to Align with User Preferences via Mental Reward Models

Mengchen Zhao, Yifan Gao, Yaqing Hou, Xiangyang Li, Pengjie Gu, Zhenhua Dong, Ruiming Tang, Yi Cai

TL;DR

MTRec tackles the misalignment between implicit feedback and real user preferences in sequential recommendation. It introduces a mental reward model learned from user behavior using a distributional IRL approach, QR-IQL, to capture stochastic satisfaction signals and guide existing recommender systems. The method yields consistent improvements on public datasets and RL-based platforms, and a real-world deployment reports a 7% uplift in average viewing time. Overall, MTRec demonstrates how explicit modeling of private user satisfaction can align recommendations with long-term user welfare and engagement, both offline and online.

Abstract

Recommendation models are predominantly trained using implicit user feedback, since explicit feedback is often costly to obtain. However, implicit feedback, such as clicks, does not always reflect users' real preferences. For example, a user might click on a news article because of its attractive headline, but end up feeling uncomfortable after reading the content. In the absence of explicit feedback, such erroneous implicit signals may severely mislead recommender systems. In this paper, we propose MTRec, a novel sequential recommendation framework designed to align with real user preferences by uncovering their internal satisfaction on recommended items. Specifically, we introduce a mental reward model to quantify user satisfaction and propose a distributional inverse reinforcement learning approach to learn it. The learned mental reward model is then used to guide recommendation models to better align with users' real preferences. Our experiments show that MTRec brings significant improvements to a variety of recommendation models. We also deploy MTRec on an industrial short video platform and observe a 7 percent increase in average user viewing time.

MTRec: Learning to Align with User Preferences via Mental Reward Models

TL;DR

MTRec tackles the misalignment between implicit feedback and real user preferences in sequential recommendation. It introduces a mental reward model learned from user behavior using a distributional IRL approach, QR-IQL, to capture stochastic satisfaction signals and guide existing recommender systems. The method yields consistent improvements on public datasets and RL-based platforms, and a real-world deployment reports a 7% uplift in average viewing time. Overall, MTRec demonstrates how explicit modeling of private user satisfaction can align recommendations with long-term user welfare and engagement, both offline and online.

Abstract

Recommendation models are predominantly trained using implicit user feedback, since explicit feedback is often costly to obtain. However, implicit feedback, such as clicks, does not always reflect users' real preferences. For example, a user might click on a news article because of its attractive headline, but end up feeling uncomfortable after reading the content. In the absence of explicit feedback, such erroneous implicit signals may severely mislead recommender systems. In this paper, we propose MTRec, a novel sequential recommendation framework designed to align with real user preferences by uncovering their internal satisfaction on recommended items. Specifically, we introduce a mental reward model to quantify user satisfaction and propose a distributional inverse reinforcement learning approach to learn it. The learned mental reward model is then used to guide recommendation models to better align with users' real preferences. Our experiments show that MTRec brings significant improvements to a variety of recommendation models. We also deploy MTRec on an industrial short video platform and observe a 7 percent increase in average user viewing time.

Paper Structure

This paper contains 21 sections, 21 equations, 4 figures, 2 tables, 2 algorithms.

Figures (4)

  • Figure 1: The overall framework of MTRec. The solid lines represent the interaction process. The dashed lines represent the information flow between data and models. Our goal is to recover the mental reward model and use it to improve the recommendation model.
  • Figure 2: Training curves of RL models. Averaged CTR is reported with 95% confidence interval.
  • Figure 3: Illustrations of the predicted mental rewards. (a) Averaged mental rewards by steps in all trajectories; (b-e) Expected and counterfactual mental rewards given actual user actions.
  • Figure 4: Online A/B test results.