Table of Contents
Fetching ...

Query-Policy Misalignment in Preference-Based Reinforcement Learning

Xiao Hu, Jianxiong Li, Xianyuan Zhan, Qing-Shan Jia, Ya-Qin Zhang

TL;DR

The paper identifies Query-Policy Misalignment as a key bottleneck in preference-based reinforcement learning, where informative queries may not align with the current policy's visitation distribution and thus provide limited policy guidance. It introduces Query-Policy Alignment (QPA), blending policy-aligned query selection from a near on-policy buffer with a hybrid experience replay to ensure reward learning and value updates stay focused on the current policy region. Empirical evaluation on DMControl and MetaWorld shows substantial gains in both human feedback efficiency and RL sample efficiency, often outperforming state-of-the-art PbRL methods. The approach is lightweight to implement and can be added to existing PbRL systems with minimal code changes, highlighting the practical importance of aligning query strategies with policy learning in PbRL applications.

Abstract

Preference-based reinforcement learning (PbRL) provides a natural way to align RL agents' behavior with human desired outcomes, but is often restrained by costly human feedback. To improve feedback efficiency, most existing PbRL methods focus on selecting queries to maximally improve the overall quality of the reward model, but counter-intuitively, we find that this may not necessarily lead to improved performance. To unravel this mystery, we identify a long-neglected issue in the query selection schemes of existing PbRL studies: Query-Policy Misalignment. We show that the seemingly informative queries selected to improve the overall quality of reward model actually may not align with RL agents' interests, thus offering little help on policy learning and eventually resulting in poor feedback efficiency. We show that this issue can be effectively addressed via near on-policy query and a specially designed hybrid experience replay, which together enforce the bidirectional query-policy alignment. Simple yet elegant, our method can be easily incorporated into existing approaches by changing only a few lines of code. We showcase in comprehensive experiments that our method achieves substantial gains in both human feedback and RL sample efficiency, demonstrating the importance of addressing query-policy misalignment in PbRL tasks.

Query-Policy Misalignment in Preference-Based Reinforcement Learning

TL;DR

The paper identifies Query-Policy Misalignment as a key bottleneck in preference-based reinforcement learning, where informative queries may not align with the current policy's visitation distribution and thus provide limited policy guidance. It introduces Query-Policy Alignment (QPA), blending policy-aligned query selection from a near on-policy buffer with a hybrid experience replay to ensure reward learning and value updates stay focused on the current policy region. Empirical evaluation on DMControl and MetaWorld shows substantial gains in both human feedback efficiency and RL sample efficiency, often outperforming state-of-the-art PbRL methods. The approach is lightweight to implement and can be added to existing PbRL systems with minimal code changes, highlighting the practical importance of aligning query strategies with policy learning in PbRL applications.

Abstract

Preference-based reinforcement learning (PbRL) provides a natural way to align RL agents' behavior with human desired outcomes, but is often restrained by costly human feedback. To improve feedback efficiency, most existing PbRL methods focus on selecting queries to maximally improve the overall quality of the reward model, but counter-intuitively, we find that this may not necessarily lead to improved performance. To unravel this mystery, we identify a long-neglected issue in the query selection schemes of existing PbRL studies: Query-Policy Misalignment. We show that the seemingly informative queries selected to improve the overall quality of reward model actually may not align with RL agents' interests, thus offering little help on policy learning and eventually resulting in poor feedback efficiency. We show that this issue can be effectively addressed via near on-policy query and a specially designed hybrid experience replay, which together enforce the bidirectional query-policy alignment. Simple yet elegant, our method can be easily incorporated into existing approaches by changing only a few lines of code. We showcase in comprehensive experiments that our method achieves substantial gains in both human feedback and RL sample efficiency, demonstrating the importance of addressing query-policy misalignment in PbRL tasks.
Paper Structure (28 sections, 5 equations, 20 figures, 5 tables, 1 algorithm)

This paper contains 28 sections, 5 equations, 20 figures, 5 tables, 1 algorithm.

Figures (20)

  • Figure 1: Illustration of query-policy misalignment. Bob's current focus is on grasping the blocks. However, the overseer advises him not to cause harm to humans instead of providing guidance on grasping techniques.
  • Figure 2: Impacts of query-policy misalignment in PbRL training. (a). 2D navigation task. RL agent should navigate to the goal. (b). The desired behavior of this task is to move to the goal in a straight line. (c). The learning curves of different query selection methods. (d). Existing query selection methods often select queries that lie outside the visitation distribution of the current policy.
  • Figure 3: Query-policy misalignment. Existing query selection methods often select queries that lie outside the visitation distribution of the current policy.
  • Figure 4: Learning curves of QPA and QPA (20% or 50% query explore) on locomotion tasks. In QPA, 100% of the queries are sampled from the policy-aligned buffer $\mathcal{D}^{\text{pa}}$. In QPA (20% or 50% query explore), we sample 20% or 50% queries from the entire replay buffer $\mathcal{D}$, while the remaining 80% or 50% of queries are sampled from $\mathcal{D}^{\text{pa}}$. The declining performance of QPA (20% or 50% query explore) suggests that allocating some queries to the entire replay buffer in an attempt to explore high-reward state regions can compromise feedback efficiency.
  • Figure 5: Learning curves on locomotion tasks as measured on the ground truth reward. The dashed black line represents the last feedback collection step.
  • ...and 15 more figures

Theorems & Definitions (1)

  • proof