Table of Contents
Fetching ...

Preference-Guided Reinforcement Learning for Efficient Exploration

Guojian Wang, Jianxiang Liu, Xinyuan Li, Faguo Wu, Xiao Zhang, Tianyuan Chen, Xuyang Chen

TL;DR

The paper tackles the challenge of efficient exploration in hard-exploration reinforcement learning by removing the need to learn a reward model from human preferences. It introduces LOPE, an end-to-end preference-guided RL framework that directly optimizes the policy using trajectory preferences via a two-step process: trust-region-based policy improvement and a preference-guided update, reformulated as a trajectory-wise state marginal matching objective using maximum mean discrepancy. The key contributions include the trajectory-wise SMM objective, a policy-gradient formulation with intrinsic rewards derived from preferences, a formal per-iteration performance bound, and extensive experiments showing faster convergence and better final performance across grid-world, continuous mazes, and MuJoCo tasks. LOPE demonstrates strong robustness to label noise, effectiveness across various kernels, and the ability to approach near-optimal behavior without explicit reward learning, advancing practical PbRL for long-horizon tasks.

Abstract

In this paper, we investigate preference-based reinforcement learning (PbRL), which enables reinforcement learning (RL) agents to learn from human feedback. This is particularly valuable when defining a fine-grain reward function is not feasible. However, this approach is inefficient and impractical for promoting deep exploration in hard-exploration tasks with long horizons and sparse rewards. To tackle this issue, we introduce LOPE: \textbf{L}earning \textbf{O}nline with trajectory \textbf{P}reference guidanc\textbf{E}, an end-to-end preference-guided RL framework that enhances exploration efficiency in hard-exploration tasks. Our intuition is that LOPE directly adjusts the focus of online exploration by considering human feedback as guidance, thereby avoiding the need to learn a separate reward model from preferences. Specifically, LOPE includes a two-step sequential policy optimization technique consisting of trust-region-based policy improvement and preference guidance steps. We reformulate preference guidance as a trajectory-wise state marginal matching problem that minimizes the maximum mean discrepancy distance between the preferred trajectories and the learned policy. Furthermore, we provide a theoretical analysis to characterize the performance improvement bound and evaluate the effectiveness of the LOPE. When assessed in various challenging hard-exploration environments, LOPE outperforms several state-of-the-art methods in terms of convergence rate and overall performance.The code used in this study is available at https://github.com/buaawgj/LOPE.

Preference-Guided Reinforcement Learning for Efficient Exploration

TL;DR

The paper tackles the challenge of efficient exploration in hard-exploration reinforcement learning by removing the need to learn a reward model from human preferences. It introduces LOPE, an end-to-end preference-guided RL framework that directly optimizes the policy using trajectory preferences via a two-step process: trust-region-based policy improvement and a preference-guided update, reformulated as a trajectory-wise state marginal matching objective using maximum mean discrepancy. The key contributions include the trajectory-wise SMM objective, a policy-gradient formulation with intrinsic rewards derived from preferences, a formal per-iteration performance bound, and extensive experiments showing faster convergence and better final performance across grid-world, continuous mazes, and MuJoCo tasks. LOPE demonstrates strong robustness to label noise, effectiveness across various kernels, and the ability to approach near-optimal behavior without explicit reward learning, advancing practical PbRL for long-horizon tasks.

Abstract

In this paper, we investigate preference-based reinforcement learning (PbRL), which enables reinforcement learning (RL) agents to learn from human feedback. This is particularly valuable when defining a fine-grain reward function is not feasible. However, this approach is inefficient and impractical for promoting deep exploration in hard-exploration tasks with long horizons and sparse rewards. To tackle this issue, we introduce LOPE: \textbf{L}earning \textbf{O}nline with trajectory \textbf{P}reference guidanc\textbf{E}, an end-to-end preference-guided RL framework that enhances exploration efficiency in hard-exploration tasks. Our intuition is that LOPE directly adjusts the focus of online exploration by considering human feedback as guidance, thereby avoiding the need to learn a separate reward model from preferences. Specifically, LOPE includes a two-step sequential policy optimization technique consisting of trust-region-based policy improvement and preference guidance steps. We reformulate preference guidance as a trajectory-wise state marginal matching problem that minimizes the maximum mean discrepancy distance between the preferred trajectories and the learned policy. Furthermore, we provide a theoretical analysis to characterize the performance improvement bound and evaluate the effectiveness of the LOPE. When assessed in various challenging hard-exploration environments, LOPE outperforms several state-of-the-art methods in terms of convergence rate and overall performance.The code used in this study is available at https://github.com/buaawgj/LOPE.
Paper Structure (31 sections, 7 theorems, 33 equations, 13 figures, 1 table, 1 algorithm)

This paper contains 31 sections, 7 theorems, 33 equations, 13 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

Let $r_g(s, a)$ denote the preference guidance-based rewards, and it is expressed as: $\operatorname{dist}(\cdot, \cdot)$ is defined in Eq. eq:dist. Then, Eq. eq:perference_guidance can be expanded as follows: Here, $d_\theta$ is the discounted state visitation distribution defined in Section sec:Preliminaries

Figures (13)

  • Figure 1: Illustration of our method. Due to the hard exploration and sparse reward, the traditional policy improvement is inefficient. In this work, we propose an end-to-end preference-guided RL framework to achieve efficient exploration.
  • Figure 2: The advantage of the guiding policy $\pi_b$ over the current policy.
  • Figure 3: The four environments for evaluating the LOPE's performance: (a) Grid World; (b) AntMaze-Umaze; (c) PointMaze-Medium; (d) PointMaze-Large.
  • Figure 4: The two locomotion control tasks: (a) SparseHalfCheetah; (b) SparseHopper.
  • Figure 5: (a) Learning curves of success rate in the grid-world maze with a fixed goal; (b) Learning curves of success rate in the grid-world maze with random goals.
  • ...and 8 more figures

Theorems & Definitions (14)

  • Lemma 1: Preference guidance
  • Remark 1
  • Proposition 1: Performance improvement bound of the PI step achiam2017constrained
  • Lemma 2: Performance improvement bound of the PG step
  • Remark 2
  • Theorem 1: Performance improvement bound of LOPE
  • Remark 3: Comparison with Prior Preference-Based Bounds
  • proof
  • Definition 1: $\alpha$-coupled policies schulman2015trustkang2018policy
  • Lemma 3: Adopted from levin2017markov
  • ...and 4 more