Two-Step Offline Preference-Based Reinforcement Learning with Constrained Actions

Yinglun Xu; Tarun Suresh; Rohan Gumaste; David Zhu; Ruirui Li; Zhengyang Wang; Haoming Jiang; Xianfeng Tang; Qingyu Yin; Monica Xiao Cheng; Qi Zeng; Chao Zhang; Gagandeep Singh

Two-Step Offline Preference-Based Reinforcement Learning with Constrained Actions

Yinglun Xu, Tarun Suresh, Rohan Gumaste, David Zhu, Ruirui Li, Zhengyang Wang, Haoming Jiang, Xianfeng Tang, Qingyu Yin, Monica Xiao Cheng, Qi Zeng, Chao Zhang, Gagandeep Singh

TL;DR

A novel two-step learning method called PRC: preference-based reinforcement learning with constrained actions, which is to limit the reinforcement learning agent to optimize over a constrained action space that excludes the out-of-distribution state-actions.

Abstract

Preference-based reinforcement learning (PBRL) in the offline setting has succeeded greatly in industrial applications such as chatbots. A two-step learning framework where one applies a reinforcement learning step after a reward modeling step has been widely adopted for the problem. However, such a method faces challenges from the risk of reward hacking and the complexity of reinforcement learning. To overcome the challenge, our insight is that both challenges come from the state-actions not supported in the dataset. Such state-actions are unreliable and increase the complexity of the reinforcement learning problem at the second step. Based on the insight, we develop a novel two-step learning method called PRC: preference-based reinforcement learning with constrained actions. The high-level idea is to limit the reinforcement learning agent to optimize over a constrained action space that excludes the out-of-distribution state-actions. We empirically verify that our method has high learning efficiency on various datasets in robotic control environments.

Two-Step Offline Preference-Based Reinforcement Learning with Constrained Actions

TL;DR

Abstract

Paper Structure (17 sections, 4 equations, 3 figures, 1 table, 3 algorithms)

This paper contains 17 sections, 4 equations, 3 figures, 1 table, 3 algorithms.

Introduction
Related Work
Preliminary
Offline Preference Reinforcement Learning Problem
Traditional PPO Learning with KL-regularization
Preference Based Reinforcement Learning on a Constrained Action Space (PRC)
General PRC Algorithm
Analysis
Practical Implementation
Experiments
Setup
Learning Efficiency Evaluation
Pessimism Effectiveness
Reinforcement Learning Efficiency on Constrained Action Space
Reinforcement Learning Complexity in PBRL
...and 2 more sections

Figures (3)

Figure 1: Comparison between the trend of the performance of the learned policies on the learned (simulated performance) and true reward models (true performance) during training. An algorithm is not pessimistic enough if the two trends are not aligned.
Figure 2: Comparison between the performance of the learned policies on the learned reward models during training. The reinforcement learning complexity is less in a setting if the simulated performance is high.
Figure 3: For each dataset, a pair of reward models are trained on the true rewards and preference signals. The same RL algorithm is applied to learn on both reward models. The learning difficulty on a reward model is less if the PPO algorithm can learn better policies according to the reward model.

Two-Step Offline Preference-Based Reinforcement Learning with Constrained Actions

TL;DR

Abstract

Two-Step Offline Preference-Based Reinforcement Learning with Constrained Actions

Authors

TL;DR

Abstract

Table of Contents

Figures (3)