Table of Contents
Fetching ...

An Extremely Data-efficient and Generative LLM-based Reinforcement Learning Agent for Recommenders

Shuang Feng, Grace Feng

TL;DR

This work investigates data-efficient reinforcement learning for web-based recommender systems by integrating large language models with RL methods. It compares Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO) within the WebShop benchmark, exploring semi-generative training from human trajectories and self-learning with generated trajectories. The results show that DPO can achieve higher task performance and success rates than PPO in under 1 hour of training, with generated trajectories offering comparable gains to human data and reducing data collection costs. The findings suggest that generative data and preference-based learning can substantially improve data efficiency for RL-driven recommenders, with practical implications for production systems.

Abstract

Recent advancements in large language models (LLMs) have enabled understanding webpage contexts, product details, and human instructions. Utilizing LLMs as the foundational architecture for either reward models or policies in reinforcement learning has gained popularity -- a notable achievement is the success of InstructGPT. RL algorithms have been instrumental in maximizing long-term customer satisfaction and avoiding short-term, myopic goals in industrial recommender systems, which often rely on deep learning models to predict immediate clicks or purchases. In this project, several RL methods are implemented and evaluated using the WebShop benchmark environment, data, simulator, and pre-trained model checkpoints. The goal is to train an RL agent to maximize the purchase reward given a detailed human instruction describing a desired product. The RL agents are developed by fine-tuning a pre-trained BERT model with various objectives, learning from preferences without a reward model, and employing contemporary training techniques such as Proximal Policy Optimization (PPO) as used in InstructGPT, and Direct Preference Optimization (DPO). This report also evaluates the RL agents trained using generative trajectories. Evaluations were conducted using Thompson sampling in the WebShop simulator environment. The simulated online experiments demonstrate that agents trained on generated trajectories exhibited comparable task performance to those trained using human trajectories. This has demonstrated an example of an extremely low-cost data-efficient way of training reinforcement learning agents. Also, with limited training time (<2hours), without utilizing any images, a DPO agent achieved a 19% success rate after approximately 3000 steps or 30 minutes of training on T4 GPUs, compared to a PPO agent, which reached a 15% success rate.

An Extremely Data-efficient and Generative LLM-based Reinforcement Learning Agent for Recommenders

TL;DR

This work investigates data-efficient reinforcement learning for web-based recommender systems by integrating large language models with RL methods. It compares Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO) within the WebShop benchmark, exploring semi-generative training from human trajectories and self-learning with generated trajectories. The results show that DPO can achieve higher task performance and success rates than PPO in under 1 hour of training, with generated trajectories offering comparable gains to human data and reducing data collection costs. The findings suggest that generative data and preference-based learning can substantially improve data efficiency for RL-driven recommenders, with practical implications for production systems.

Abstract

Recent advancements in large language models (LLMs) have enabled understanding webpage contexts, product details, and human instructions. Utilizing LLMs as the foundational architecture for either reward models or policies in reinforcement learning has gained popularity -- a notable achievement is the success of InstructGPT. RL algorithms have been instrumental in maximizing long-term customer satisfaction and avoiding short-term, myopic goals in industrial recommender systems, which often rely on deep learning models to predict immediate clicks or purchases. In this project, several RL methods are implemented and evaluated using the WebShop benchmark environment, data, simulator, and pre-trained model checkpoints. The goal is to train an RL agent to maximize the purchase reward given a detailed human instruction describing a desired product. The RL agents are developed by fine-tuning a pre-trained BERT model with various objectives, learning from preferences without a reward model, and employing contemporary training techniques such as Proximal Policy Optimization (PPO) as used in InstructGPT, and Direct Preference Optimization (DPO). This report also evaluates the RL agents trained using generative trajectories. Evaluations were conducted using Thompson sampling in the WebShop simulator environment. The simulated online experiments demonstrate that agents trained on generated trajectories exhibited comparable task performance to those trained using human trajectories. This has demonstrated an example of an extremely low-cost data-efficient way of training reinforcement learning agents. Also, with limited training time (<2hours), without utilizing any images, a DPO agent achieved a 19% success rate after approximately 3000 steps or 30 minutes of training on T4 GPUs, compared to a PPO agent, which reached a 15% success rate.
Paper Structure (13 sections, 7 equations, 6 figures)

This paper contains 13 sections, 7 equations, 6 figures.

Figures (6)

  • Figure 1: WebShop Environment webshop
  • Figure 2: WebShop Human Instructions and Human Trajectories webshop
  • Figure 3: DPO vs. PPO --- Human Trajectories and Generated Trajectories --- Scores
  • Figure 4: DPO vs. PPO --- Human Trajectories and Generated Trajectories --- Success Rate
  • Figure 5: Self-learning Using Generated Trajectories --- Scores
  • ...and 1 more figures