Online Policy Learning from Offline Preferences

Guoxi Zhang; Han Bao; Hisashi Kashima

Online Policy Learning from Offline Preferences

Guoxi Zhang, Han Bao, Hisashi Kashima

TL;DR

A framework that consolidates offline preferences and virtual preferences for PbRL, which are comparisons between the agent's behaviors and the offline data is introduced and can track the agent's behaviors using the virtual preferences, thereby offering well-aligned guidance to the agent.

Abstract

In preference-based reinforcement learning (PbRL), a reward function is learned from a type of human feedback called preference. To expedite preference collection, recent works have leveraged \emph{offline preferences}, which are preferences collected for some offline data. In this scenario, the learned reward function is fitted on the offline data. If a learning agent exhibits behaviors that do not overlap with the offline data, the learned reward function may encounter generalizability issues. To address this problem, the present study introduces a framework that consolidates offline preferences and \emph{virtual preferences} for PbRL, which are comparisons between the agent's behaviors and the offline data. Critically, the reward function can track the agent's behaviors using the virtual preferences, thereby offering well-aligned guidance to the agent. Through experiments on continuous control tasks, this study demonstrates the effectiveness of incorporating the virtual preferences in PbRL.

Online Policy Learning from Offline Preferences

TL;DR

Abstract

Paper Structure (36 sections, 3 theorems, 9 equations, 2 figures, 8 tables, 1 algorithm)

This paper contains 36 sections, 3 theorems, 9 equations, 2 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Reinforcement Learning
MDP
RL Objective
Preference-based Reinforcement Learning
Thurstone's model thrustone
Preference-based Reward Learning
Preference-based Adversarial Imitation Learning
Problem Setup
Reward Learning from Offline Trajectories
Virtual Preferences
Handling Imperfect Data
Initialization
...and 21 more sections

Key Result

Corollary 3.2

$\sum_x \rho^\pi(x) = \frac{1}{1-\gamma}$ for any policy $\pi$.

Figures (2)

Figure 1: A diagram for online PbRL ((a)), learning from offline preferences ((b)), and the generalizability problem of using offline preferences ((c)).
Figure 2: The Kendall's rank correlation coefficient between the the inferred returns and the true returns of agents' trajectories during policy learning. This coefficient reflects the generalizability of reward functions to agents' behaviors. Agents were trained on the mixture datasets with $|\mathcal{Y}|=1225$. The reward function of PbRL does not generalize well for Ant and Pusher, which explains its poor performance presented in \ref{['tab:results_mixture_all']}. These observations support our claim for the generalizability issue.

Theorems & Definitions (4)

Definition 3.1: 10.5555/528623
Corollary 3.2
Corollary 3.3
Theorem 3.4: 10.1145/1390156.1390286

Online Policy Learning from Offline Preferences

TL;DR

Abstract

Online Policy Learning from Offline Preferences

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (4)