Table of Contents
Fetching ...

Hindsight Preference Learning for Offline Preference-based Reinforcement Learning

Chen-Xiao Gao, Shengjun Fang, Chenjun Xiao, Yang Yu, Zongzhang Zhang

TL;DR

Hindsight Preference Learning (HPL) tackles offline preference-based RL by modeling human preferences as rewards conditioned on future trajectory outcomes, rather than relying on the Markovian sum of immediate rewards. A two-phase pipeline learns a reward function from a small labeled preference set and labels a large unlabeled offline dataset by marginalizing over future outcomes, where future information is encoded with a variational auto-encoder (VAE) to obtain embeddings $z_t$ for $\sigma_{t:t+k}$. The conditional reward $r_\psi(s_t,a_t|z_t)$ is trained via a Bradley–Terry-based preference model, and rewards for the unlabeled data are obtained by averaging over the prior distribution $f_\theta(z_t|s_t,a_t)$. Empirical results across Gym-MuJoCo, Adroit, and Meta-World tasks show that HPL yields more robust and advantageous rewards and improved policy performance, even under distribution shifts between labeled and unlabeled data. The work demonstrates the potential of exploiting large unlabeled datasets to improve credit assignment in offline PbRL, with code released for reproducibility.

Abstract

Offline preference-based reinforcement learning (RL), which focuses on optimizing policies using human preferences between pairs of trajectory segments selected from an offline dataset, has emerged as a practical avenue for RL applications. Existing works rely on extracting step-wise reward signals from trajectory-wise preference annotations, assuming that preferences correlate with the cumulative Markovian rewards. However, such methods fail to capture the holistic perspective of data annotation: Humans often assess the desirability of a sequence of actions by considering the overall outcome rather than the immediate rewards. To address this challenge, we propose to model human preferences using rewards conditioned on future outcomes of the trajectory segments, i.e. the hindsight information. For downstream RL optimization, the reward of each step is calculated by marginalizing over possible future outcomes, the distribution of which is approximated by a variational auto-encoder trained using the offline dataset. Our proposed method, Hindsight Preference Learning (HPL), can facilitate credit assignment by taking full advantage of vast trajectory data available in massive unlabeled datasets. Comprehensive empirical studies demonstrate the benefits of HPL in delivering robust and advantageous rewards across various domains. Our code is publicly released at https://github.com/typoverflow/WiseRL.

Hindsight Preference Learning for Offline Preference-based Reinforcement Learning

TL;DR

Hindsight Preference Learning (HPL) tackles offline preference-based RL by modeling human preferences as rewards conditioned on future trajectory outcomes, rather than relying on the Markovian sum of immediate rewards. A two-phase pipeline learns a reward function from a small labeled preference set and labels a large unlabeled offline dataset by marginalizing over future outcomes, where future information is encoded with a variational auto-encoder (VAE) to obtain embeddings for . The conditional reward is trained via a Bradley–Terry-based preference model, and rewards for the unlabeled data are obtained by averaging over the prior distribution . Empirical results across Gym-MuJoCo, Adroit, and Meta-World tasks show that HPL yields more robust and advantageous rewards and improved policy performance, even under distribution shifts between labeled and unlabeled data. The work demonstrates the potential of exploiting large unlabeled datasets to improve credit assignment in offline PbRL, with code released for reproducibility.

Abstract

Offline preference-based reinforcement learning (RL), which focuses on optimizing policies using human preferences between pairs of trajectory segments selected from an offline dataset, has emerged as a practical avenue for RL applications. Existing works rely on extracting step-wise reward signals from trajectory-wise preference annotations, assuming that preferences correlate with the cumulative Markovian rewards. However, such methods fail to capture the holistic perspective of data annotation: Humans often assess the desirability of a sequence of actions by considering the overall outcome rather than the immediate rewards. To address this challenge, we propose to model human preferences using rewards conditioned on future outcomes of the trajectory segments, i.e. the hindsight information. For downstream RL optimization, the reward of each step is calculated by marginalizing over possible future outcomes, the distribution of which is approximated by a variational auto-encoder trained using the offline dataset. Our proposed method, Hindsight Preference Learning (HPL), can facilitate credit assignment by taking full advantage of vast trajectory data available in massive unlabeled datasets. Comprehensive empirical studies demonstrate the benefits of HPL in delivering robust and advantageous rewards across various domains. Our code is publicly released at https://github.com/typoverflow/WiseRL.
Paper Structure (31 sections, 11 equations, 12 figures, 7 tables, 1 algorithm)

This paper contains 31 sections, 11 equations, 12 figures, 7 tables, 1 algorithm.

Figures (12)

  • Figure 1: Illustration of the reward learning procedure in HPL. Unlike previous methods, HPL first generates embeddings $z_t$ to encode the future part of the segments and optimize a reward function $r_{\psi}$ which is conditioned on the $s_t$, $a_t$ and the future $z_t$ using the Bradley-Terry model.
  • Figure 2: A gambling MDP that illustrates the potential failure modes of the MR preference model.
  • Figure 3: The rewards values given by the MR method and HPL. Each dot represents one trial and its coordinates are the estimated reward values.
  • Figure 4: The performance curves of HPL and baseline methods in tasks with mismatched datasets. We report the average (solid line) and the standard deviation (shaded area) of each algorithm across 5 random seeds and 10 evaluation episodes for each seed.
  • Figure 5: Left: The rendered image of one raw trajectory selected from the offline dataset (top row) and the reconstruction by the VAE (bottom row). Right: The relationship between the log-probabilities of segments and their embeddings.
  • ...and 7 more figures