S-EPOA: Overcoming the Indistinguishability of Segments with Skill-Driven Preference-Based Reinforcement Learning

Ni Mu; Yao Luan; Yiqin Yang; Bo Xu; Qing-shan Jia

S-EPOA: Overcoming the Indistinguishability of Segments with Skill-Driven Preference-Based Reinforcement Learning

Ni Mu, Yao Luan, Yiqin Yang, Bo Xu, Qing-shan Jia

TL;DR

This work targets the practical bottleneck of segment indistinguishability in preference-based reinforcement learning (PbRL). It introduces S-EPOA, which combines skill-based unsupervised pretraining with a novel skill-aware query selection mechanism to generate more distinguishable human queries and improve reward learning. The approach is supported by theoretical considerations of query disagreement and extensive experiments on DMControl and Metaworld showing improved robustness and learning efficiency under non-ideal human feedback, along with ablations that validate the contribution of each component. The framework broadens the applicability of PbRL by mitigating labeling errors and enhancing sample efficiency in complex robotic and locomotion tasks.

Abstract

Preference-based reinforcement learning (PbRL) stands out by utilizing human preferences as a direct reward signal, eliminating the need for intricate reward engineering. However, despite its potential, traditional PbRL methods are often constrained by the indistinguishability of segments, which impedes the learning process. In this paper, we introduce Skill-Enhanced Preference Optimization Algorithm (S-EPOA), which addresses the segment indistinguishability issue by integrating skill mechanisms into the preference learning framework. Specifically, we first conduct the unsupervised pretraining to learn useful skills. Then, we propose a novel query selection mechanism to balance the information gain and distinguishability over the learned skill space. Experimental results on a range of tasks, including robotic manipulation and locomotion, demonstrate that S-EPOA significantly outperforms conventional PbRL methods in terms of both robustness and learning efficiency. The results highlight the effectiveness of skill-driven learning in overcoming the challenges posed by segment indistinguishability.

S-EPOA: Overcoming the Indistinguishability of Segments with Skill-Driven Preference-Based Reinforcement Learning

TL;DR

Abstract

Paper Structure (44 sections, 2 theorems, 21 equations, 9 figures, 9 tables, 3 algorithms)

This paper contains 44 sections, 2 theorems, 21 equations, 9 figures, 9 tables, 3 algorithms.

Introduction
Preliminaries
Reinforcement Learning.
Preference-based Reinforcement Learning.
Indistinguishability of Segments
Human experiments.
Skill-Driven PbRL
Skill-based Unsupervised Pretraining
Skill-based Query Selection
Implementation Details
Experiments
Setup
Domains.
Baselines and Implementation.
Noisy scripted teacher imitating humans.
...and 29 more sections

Key Result

Proposition 1

Let $\{\hat{r}^i\}$ be an ensemble of i.i.d. reward estimators, and $(\sigma_1,\sigma_2)$ be a segment pair with ground-truth cumulative discounted reward $r_1\ge r_2$. Suppose $\hat{r}^i$ estimates the cumulative discounted reward of $\sigma_j$ as $\hat{r}^i_j\sim N(r_j, c)$ ($c$ is a constant), an Then the disagreement of induced preference across $\{\hat{r}^i\}$, i.e. $\mathrm{Var}[\hat{P}[\sig

Figures (9)

Figure 1: The framework of S-EPOA. In the pretraining phase, we learn diverse skills based on unsupervised skill discovery methods. In the online training phase, we leverage a novel skill-based query selection method to generate distinguishable queries for non-ideal teachers.
Figure 2: Human-labeled preferences match ratio with ground truth. As the return differences decrease, labeling errors increase.
Figure 3: Learning curves on locomotion tasks from DMControl, where each row corresponds to a different error rate $\epsilon$, and each column represents a specific task. SAC serves as an oracle, using the ground-truth reward unavailable in PbRL settings. The solid line and shaded regions respectively represent the mean and standard deviation of episode return, across $5$ seeds.
Figure 4: Learning curves on locomotion tasks from Metaworld, with error rate $\epsilon=0.2$, across $5$ seeds.
Figure 5: Ablation studies on the task. (a) Contribution of each technique in S-EPOA, under $\epsilon=0.3$. (b) Demonstration of enhanced learning efficiency of S-EPOA under the ideal scripted teacher with error rate $\epsilon=0$. (c) The learning curve of S-EPOA and baselines, with and without data augmentation under $\epsilon=0.3$. (d) Integrating S-EPOA with other skill discovery methods, under $\epsilon=0.3$.
...and 4 more figures

Theorems & Definitions (3)

Proposition 1
Proposition 1
proof

S-EPOA: Overcoming the Indistinguishability of Segments with Skill-Driven Preference-Based Reinforcement Learning

TL;DR

Abstract

S-EPOA: Overcoming the Indistinguishability of Segments with Skill-Driven Preference-Based Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (3)