SUBER: An RL Environment with Simulated Human Behavior for Recommender Systems

Nathan Corecco; Giorgio Piatti; Luca A. Lanzendörfer; Flint Xiaofeng Fan; Roger Wattenhofer

SUBER: An RL Environment with Simulated Human Behavior for Recommender Systems

Nathan Corecco, Giorgio Piatti, Luca A. Lanzendörfer, Flint Xiaofeng Fan, Roger Wattenhofer

TL;DR

SUBER introduces a versatile RL environment for recommender systems by using Large Language Models (LLMs) to simulate human user behavior within a modular gym-like framework. The environment combines memory, pre-processing, and post-processing modules to generate observations, prompts, and rewards, enabling on-policy RL training without real user data. Through extensive ablations across movie and book domains, the study analyzes prompting strategies, retrieval methods, LLM sizes, and reward perturbation/shaping, demonstrating that similarity-based retrieval and 2-shot prompting with a custom system prompt yield high fidelity to human preferences. The framework provides both synthetic data generation and evaluation capabilities, offering a practical path to train and benchmark RL-based recommenders when real online data are scarce; code is openly available for replication and extension.

Abstract

Reinforcement learning (RL) has gained popularity in the realm of recommender systems due to its ability to optimize long-term rewards and guide users in discovering relevant content. However, the successful implementation of RL in recommender systems is challenging because of several factors, including the limited availability of online data for training on-policy methods. This scarcity requires expensive human interaction for online model training. Furthermore, the development of effective evaluation frameworks that accurately reflect the quality of models remains a fundamental challenge in recommender systems. To address these challenges, we propose a comprehensive framework for synthetic environments that simulate human behavior by harnessing the capabilities of large language models (LLMs). We complement our framework with in-depth ablation studies and demonstrate its effectiveness with experiments on movie and book recommendations. Using LLMs as synthetic users, this work introduces a modular and novel framework to train RL-based recommender systems. The software, including the RL environment, is publicly available on GitHub.

SUBER: An RL Environment with Simulated Human Behavior for Recommender Systems

TL;DR

Abstract

Paper Structure (54 sections, 11 equations, 11 figures, 21 tables)

This paper contains 54 sections, 11 equations, 11 figures, 21 tables.

Related Work
RL for Recommender Systems.
Large Language Models.
Framework
Memory
Pre-processing
Item Retrieval.
Prompting.
Postprocessing
Reward Perturbation.
Reward Shaping.
Experiments
Setup
Ablations
Genres/Categories.
...and 39 more sections

Figures (11)

Figure 1: Overview of SUBER. The environment is built as a modular framework where each component can be modified as required. The basic control flow is as follows: The environment provides an observation using the memory module; the RL model returns an item recommendation in the form of an action, which is processed into a prompt by the memory and preprocessing component before being passed to the LLM. The score returned by the LLM is postprocessed, stored in memory and returned as a reward to the RL model.
Figure 2: Pipeline of one interaction between the RL model and SUBER. The environment provides an observation in the form of a user description and user-item interaction history to the RL model. The RL model then recommends an item, which is processed into a prompt together with the user description and interaction history. The LLM uses this prompt to generate a reward for the recommended item. The reward is stored as part of the user-item interaction history and returned to the RL model.
Figure 3: Aggregated score across LLM families for the movie environment (top), and for the book environment (bottom) by varying only the LLM component. For details see \ref{['app:ablations_all_extended']}.
Figure 4: Training plot of various RL models. The y-axis displays the average reward from evaluation samples.
Figure 5: Genre preferences of user generated via LLM. For each movie genre, we show in blue the percentage of generated users who like the genre. Similarly, we show in red the percentage of users who do not like the genre.
...and 6 more figures

SUBER: An RL Environment with Simulated Human Behavior for Recommender Systems

TL;DR

Abstract

SUBER: An RL Environment with Simulated Human Behavior for Recommender Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (11)