Table of Contents
Fetching ...

Online Policy Learning from Offline Preferences

Guoxi Zhang, Han Bao, Hisashi Kashima

TL;DR

A framework that consolidates offline preferences and virtual preferences for PbRL, which are comparisons between the agent's behaviors and the offline data is introduced and can track the agent's behaviors using the virtual preferences, thereby offering well-aligned guidance to the agent.

Abstract

In preference-based reinforcement learning (PbRL), a reward function is learned from a type of human feedback called preference. To expedite preference collection, recent works have leveraged \emph{offline preferences}, which are preferences collected for some offline data. In this scenario, the learned reward function is fitted on the offline data. If a learning agent exhibits behaviors that do not overlap with the offline data, the learned reward function may encounter generalizability issues. To address this problem, the present study introduces a framework that consolidates offline preferences and \emph{virtual preferences} for PbRL, which are comparisons between the agent's behaviors and the offline data. Critically, the reward function can track the agent's behaviors using the virtual preferences, thereby offering well-aligned guidance to the agent. Through experiments on continuous control tasks, this study demonstrates the effectiveness of incorporating the virtual preferences in PbRL.

Online Policy Learning from Offline Preferences

TL;DR

A framework that consolidates offline preferences and virtual preferences for PbRL, which are comparisons between the agent's behaviors and the offline data is introduced and can track the agent's behaviors using the virtual preferences, thereby offering well-aligned guidance to the agent.

Abstract

In preference-based reinforcement learning (PbRL), a reward function is learned from a type of human feedback called preference. To expedite preference collection, recent works have leveraged \emph{offline preferences}, which are preferences collected for some offline data. In this scenario, the learned reward function is fitted on the offline data. If a learning agent exhibits behaviors that do not overlap with the offline data, the learned reward function may encounter generalizability issues. To address this problem, the present study introduces a framework that consolidates offline preferences and \emph{virtual preferences} for PbRL, which are comparisons between the agent's behaviors and the offline data. Critically, the reward function can track the agent's behaviors using the virtual preferences, thereby offering well-aligned guidance to the agent. Through experiments on continuous control tasks, this study demonstrates the effectiveness of incorporating the virtual preferences in PbRL.
Paper Structure (36 sections, 3 theorems, 9 equations, 2 figures, 8 tables, 1 algorithm)

This paper contains 36 sections, 3 theorems, 9 equations, 2 figures, 8 tables, 1 algorithm.

Key Result

Corollary 3.2

$\sum_x \rho^\pi(x) = \frac{1}{1-\gamma}$ for any policy $\pi$.

Figures (2)

  • Figure 1: A diagram for online PbRL ((a)), learning from offline preferences ((b)), and the generalizability problem of using offline preferences ((c)).
  • Figure 2: The Kendall's rank correlation coefficient between the the inferred returns and the true returns of agents' trajectories during policy learning. This coefficient reflects the generalizability of reward functions to agents' behaviors. Agents were trained on the mixture datasets with $|\mathcal{Y}|=1225$. The reward function of PbRL does not generalize well for Ant and Pusher, which explains its poor performance presented in \ref{['tab:results_mixture_all']}. These observations support our claim for the generalizability issue.

Theorems & Definitions (4)

  • Definition 3.1: 10.5555/528623
  • Corollary 3.2
  • Corollary 3.3
  • Theorem 3.4: 10.1145/1390156.1390286