Table of Contents
Fetching ...

Reinforcement Learning-based Recommender Systems with Large Language Models for State Reward and Action Modeling

Jie Wang, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M. Jose

TL;DR

This paper proposes a LE Augmentation (LEA) method to further improve recommendation performance by optimising jointly the supervised component and the RL policy, using the augmented actions and historical user signals and reports experimental results on two publicly available datasets.

Abstract

Reinforcement Learning (RL)-based recommender systems have demonstrated promising performance in meeting user expectations by learning to make accurate next-item recommendations from historical user-item interactions. However, existing offline RL-based sequential recommendation methods face the challenge of obtaining effective user feedback from the environment. Effectively modeling the user state and shaping an appropriate reward for recommendation remains a challenge. In this paper, we leverage language understanding capabilities and adapt large language models (LLMs) as an environment (LE) to enhance RL-based recommenders. The LE is learned from a subset of user-item interaction data, thus reducing the need for large training data, and can synthesise user feedback for offline data by: (i) acting as a state model that produces high quality states that enrich the user representation, and (ii) functioning as a reward model to accurately capture nuanced user preferences on actions. Moreover, the LE allows to generate positive actions that augment the limited offline training data. We propose a LE Augmentation (LEA) method to further improve recommendation performance by optimising jointly the supervised component and the RL policy, using the augmented actions and historical user signals. We use LEA, the state and reward models in conjunction with state-of-the-art RL recommenders and report experimental results on two publicly available datasets.

Reinforcement Learning-based Recommender Systems with Large Language Models for State Reward and Action Modeling

TL;DR

This paper proposes a LE Augmentation (LEA) method to further improve recommendation performance by optimising jointly the supervised component and the RL policy, using the augmented actions and historical user signals and reports experimental results on two publicly available datasets.

Abstract

Reinforcement Learning (RL)-based recommender systems have demonstrated promising performance in meeting user expectations by learning to make accurate next-item recommendations from historical user-item interactions. However, existing offline RL-based sequential recommendation methods face the challenge of obtaining effective user feedback from the environment. Effectively modeling the user state and shaping an appropriate reward for recommendation remains a challenge. In this paper, we leverage language understanding capabilities and adapt large language models (LLMs) as an environment (LE) to enhance RL-based recommenders. The LE is learned from a subset of user-item interaction data, thus reducing the need for large training data, and can synthesise user feedback for offline data by: (i) acting as a state model that produces high quality states that enrich the user representation, and (ii) functioning as a reward model to accurately capture nuanced user preferences on actions. Moreover, the LE allows to generate positive actions that augment the limited offline training data. We propose a LE Augmentation (LEA) method to further improve recommendation performance by optimising jointly the supervised component and the RL policy, using the augmented actions and historical user signals. We use LEA, the state and reward models in conjunction with state-of-the-art RL recommenders and report experimental results on two publicly available datasets.
Paper Structure (28 sections, 14 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 28 sections, 14 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Self-supervised Reinforcement Learning for Recommendation (SSRL4R). (a) shows the previous offline structure, where the state for the RL agent is the hidden state from the sequential model, and the reward value is a predefined scalar. (b) shows our proposed structure, where the state is generated from a separate state model, and the reward is from a reward model.
  • Figure 2: Our approach of adapting decoder-only LLM as Environment (LE). (a) we produce token $i^e_i \in I^e$ for item $i_i\in I$ by optimizing the objective of generating the next tokens of its textual content autoregressively. (b) we learn the LE by parameter-efficient adapters $\phi$ on a small subset of user data. User-item token interactions $x^e_{1:t}$, where $x^e_i\in I^e$, is the input to generate the state representation $s^e_t$. We enhance the state representation by comparing the similarity between the state and actions through loss $\mathcal{L}_{sm}$. Reward prompt $p_t, p_{a_t}$ contains $x^e_{1:t}$ and action $a_t \in [a^+_t, a^-_t]$, where $a^+_t$ is the positive action (next interacted item), and $a^-_t$ is the negative action (sampled uninteracted item). The action-specific reward for a user is produced by a score head $\theta$, and the LLM is trained by comparing user preferences for actions via loss $\mathcal{L}_{rm}$.
  • Figure 3: Structure of LEA. Left: the LE is applied to offline data. $(x_1^{(e)}, x_2^{(e)}, ... ,x_t^{(e)})$ denotes the user-item interaction $x_{1:t}$ for the sequential model, where $x_i \in I$, and $x_{1:t}^e$ denotes the user-item token interaction for the LE, where $x^e_i \in I^e$. $a_t^e$ is the positive action predicted by LE. Right: RL policy is trained via the original Q-loss and the augmented one $\mathcal{L}_{aq}$; the base sequential model is jointly trained through RL loss and the supervised loss over the original next item and the augmented one $\mathcal{L}_{ah}$ over $a_t^e$.
  • Figure 4: LEASR with different weights of (a) supervised learning and (b) Q-learning augmentation loss on the LFM dataset.
  • Figure 5: Effect of scaling the training data for LE. The result of LEASR on the LFM dataset.