ROLeR: Effective Reward Shaping in Offline Reinforcement Learning for Recommender Systems

Yi Zhang; Ruihong Qiu; Jiajun Liu; Sen Wang

ROLeR: Effective Reward Shaping in Offline Reinforcement Learning for Recommender Systems

Yi Zhang, Ruihong Qiu, Jiajun Liu, Sen Wang

TL;DR

ROLeR tackles a core challenge in model-based offline reinforcement learning for recommender systems: the accuracy of reward estimation and reliance on world-model ensembles for uncertainty. It introduces a non-parametric reward shaping mechanism based on kNN clustering of user indicators and a clustering-distance-based uncertainty penalty that reduces dependence on ensembles. The approach delivers state-of-the-art performance across four benchmark RS datasets, with improvements in both long-term cumulative reward and single-step feedback, and demonstrates robustness to hyperparameters. By improving reward quality and tempering risky actions without heavy ensemble methods, ROLeR offers a scalable, practical advancement for offline recommender systems.

Abstract

Offline reinforcement learning (RL) is an effective tool for real-world recommender systems with its capacity to model the dynamic interest of users and its interactive nature. Most existing offline RL recommender systems focus on model-based RL through learning a world model from offline data and building the recommendation policy by interacting with this model. Although these methods have made progress in the recommendation performance, the effectiveness of model-based offline RL methods is often constrained by the accuracy of the estimation of the reward model and the model uncertainties, primarily due to the extreme discrepancy between offline logged data and real-world data in user interactions with online platforms. To fill this gap, a more accurate reward model and uncertainty estimation are needed for the model-based RL methods. In this paper, a novel model-based Reward Shaping in Offline Reinforcement Learning for Recommender Systems, ROLeR, is proposed for reward and uncertainty estimation in recommendation systems. Specifically, a non-parametric reward shaping method is designed to refine the reward model. In addition, a flexible and more representative uncertainty penalty is designed to fit the needs of recommendation systems. Extensive experiments conducted on four benchmark datasets showcase that ROLeR achieves state-of-the-art performance compared with existing baselines. The source code can be downloaded at https://github.com/ArronDZhang/ROLeR.

ROLeR: Effective Reward Shaping in Offline Reinforcement Learning for Recommender Systems

TL;DR

Abstract

Paper Structure (31 sections, 1 theorem, 26 equations, 3 figures, 6 tables, 1 algorithm)

This paper contains 31 sections, 1 theorem, 26 equations, 3 figures, 6 tables, 1 algorithm.

Introduction
Related Work
RL in Recommendation Systems
Offline Reinforcement Learning
Preliminaries
Interactive Recommendation
Reinforcement Learning Formulation
Offline Reinforcement Learning
Method
Problem Definition
World Model Learning
State Tracker
Action Representation
Policy Learning Pipeline
Reward Shaping
...and 16 more sections

Key Result

Theorem 1

For offline data of size $N$, $\hat{Q}^*$ is its optimal value function with respect to its world model. If $\mathcal{T}\hat{Q}^*$ is $L-$smooth, then with probability at least 1 - $\delta$,

Figures (3)

Figure 1: The reward estimation error of a world model and ROLeR across different intervals. Our training-free reward shaping constantly outperforms that of the current world model, reaching a higher relative cumulative reward.
Figure 2: The overall performance on KuaiRand and KuaiRec.
Figure 3: Robustness w.r.t. different $k$s on four datasets.

Theorems & Definitions (1)

Theorem 1

ROLeR: Effective Reward Shaping in Offline Reinforcement Learning for Recommender Systems

TL;DR

Abstract

ROLeR: Effective Reward Shaping in Offline Reinforcement Learning for Recommender Systems

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (1)