RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences

Jie Cheng; Gang Xiong; Xingyuan Dai; Qinghai Miao; Yisheng Lv; Fei-Yue Wang

RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences

Jie Cheng, Gang Xiong, Xingyuan Dai, Qinghai Miao, Yisheng Lv, Fei-Yue Wang

TL;DR

This paper presents RIME, a robust PbRL algorithm for effective reward learning from noisy preferences, which utilizes a sample selection-based discriminator to dynamically filter out noise and ensure robust training.

Abstract

Preference-based Reinforcement Learning (PbRL) circumvents the need for reward engineering by harnessing human preferences as the reward signal. However, current PbRL methods excessively depend on high-quality feedback from domain experts, which results in a lack of robustness. In this paper, we present RIME, a robust PbRL algorithm for effective reward learning from noisy preferences. Our method utilizes a sample selection-based discriminator to dynamically filter out noise and ensure robust training. To counteract the cumulative error stemming from incorrect selection, we suggest a warm start for the reward model, which additionally bridges the performance gap during the transition from pre-training to online training in PbRL. Our experiments on robotic manipulation and locomotion tasks demonstrate that RIME significantly enhances the robustness of the state-of-the-art PbRL method. Code is available at https://github.com/CJReinforce/RIME_ICML2024.

RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences

TL;DR

Abstract

Paper Structure (20 sections, 3 theorems, 16 equations, 8 figures, 18 tables, 1 algorithm)

This paper contains 20 sections, 3 theorems, 16 equations, 8 figures, 18 tables, 1 algorithm.

Introduction
Related work
Preliminaries
RIME
Denoising Discriminator
Warm Start
Experiments
Setups
Results
Ablation Study
Conclusion
RIME Algorithm Details
Effects of biased reward model
Proof for Theorem \ref{['the:KL']}
Experimental Details
...and 5 more sections

Key Result

Theorem 4.1

Consider a preference dataset $\{(\sigma^0_i,\sigma^1_i,\tilde{y}_i)\}_{i=1}^n$, where $\tilde{y}_i$ is the annotated label for the segment pair $(\sigma^0_i,\sigma^1_i)$ with the ground truth label $y_i$. Let $x_i$ denote the tuple $(\sigma^0_i, \sigma^1_i)$. Assume the cross-entropy loss $\mathcal

Figures (8)

Figure 1: Overview of RIME. In the pre-training phase, we warm start the reward model $\hat{r}_\psi$ with intrinsic rewards $r^\text{int}$ to facilitate a smooth transition to the online training phase. Post pre-training, the policy, Q-network, and reward model $\hat{r}_\psi$ are all inherited as initial configurations for online training. During online training, we utilize a denoising discriminator to screen denoised preferences for robust reward learning. This discriminator employs a dynamic lower bound $\tau_\text{lower}$ on the KL divergence between predicted preferences $P_\psi$ and annotated preference labels $\tilde{y}$ to filter trustworthy samples $\mathcal{D}_t$, and an upper bound $\tau_\text{upper}$ to flip highly unreliable labels $\mathcal{D}_f$.
Figure 2: Performance degradation during transition on Walker-walk (left) and Quadruped-walk (right) with 30% noisy preferences. We pre-train an agent using SAC for 20k steps. The warm start method shows a smaller transition gap and faster recovery.
Figure 3: Learning curves for robotic manipulation tasks from Meta-world, where each row represents a specific task and each column corresponds to a different error rate $\epsilon$. SAC serves as a performance upper bound, using a ground-truth reward function unavailable in PbRL settings. The corresponding number of feedback in total and per session are shown in Table \ref{['table:hyperparameters_condition']}. The solid line and shaded regions respectively denote the mean and standard deviation of the success rate, across ten runs.
Figure 4: Learning curves on locomotion tasks from DMControl, where each row represents a specific task and each column corresponds to a different error rate $\epsilon$ setting. SAC serves as a performance upper bound, using a ground-truth reward function unavailable in PbRL settings. The corresponding number of feedback in total and per session are shown in Table \ref{['table:hyperparameters_condition']}. The solid line and shaded regions respectively denote the mean and standard deviation of episode return, across ten runs.
Figure 5: Ablation study on real non-expert human teachers
...and 3 more figures

Theorems & Definitions (5)

Theorem 4.1: KL Divergence Lower Bound for Corrupted Samples
Theorem 2.2: Upper bound of Q-function error
proof
Theorem : \ref{['the:KL']}
proof

RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences

TL;DR

Abstract

RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (5)