Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback

Yifu Yuan; Jianye Hao; Yi Ma; Zibin Dong; Hebin Liang; Jinyi Liu; Zhixin Feng; Kai Zhao; Yan Zheng

Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback

Yifu Yuan, Jianye Hao, Yi Ma, Zibin Dong, Hebin Liang, Jinyi Liu, Zhixin Feng, Kai Zhao, Yan Zheng

TL;DR

Uni-RLHF develops a user-friendly annotation interface tailored to various feedback types, compatible with a wide range of mainstream RL environments, and establishes a systematic pipeline of crowdsourced annotations, resulting in large-scale annotated datasets comprising more than 15 million steps across 30+ popular tasks.

Abstract

Reinforcement Learning with Human Feedback (RLHF) has received significant attention for performing tasks without the need for costly manual reward design by aligning human preferences. It is crucial to consider diverse human feedback types and various learning methods in different environments. However, quantifying progress in RLHF with diverse feedback is challenging due to the lack of standardized annotation platforms and widely used unified benchmarks. To bridge this gap, we introduce Uni-RLHF, a comprehensive system implementation tailored for RLHF. It aims to provide a complete workflow from real human feedback, fostering progress in the development of practical problems. Uni-RLHF contains three packages: 1) a universal multi-feedback annotation platform, 2) large-scale crowdsourced feedback datasets, and 3) modular offline RLHF baseline implementations. Uni-RLHF develops a user-friendly annotation interface tailored to various feedback types, compatible with a wide range of mainstream RL environments. We then establish a systematic pipeline of crowdsourced annotations, resulting in large-scale annotated datasets comprising more than 15 million steps across 30+ popular tasks. Through extensive experiments, the results in the collected datasets demonstrate competitive performance compared to those from well-designed manual rewards. We evaluate various design choices and offer insights into their strengths and potential areas of improvement. We wish to build valuable open-source platforms, datasets, and baselines to facilitate the development of more robust and reliable RLHF solutions based on realistic human feedback. The website is available at https://uni-rlhf.github.io/.

Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback

TL;DR

Abstract

Paper Structure (38 sections, 12 equations, 15 figures, 13 tables)

This paper contains 38 sections, 12 equations, 15 figures, 13 tables.

Introduction
Related Work
Universal Platform for Reinforcement Learning with Diverse Feedback Types
Implementation for Multi-feedback Annotation Platform
Standardized Feedback Encoding Format for Reinforcement Learning
Large-scale Crowdsourced Annotation Pipeline
Evaluating Benchmarks for offline RLHF
Evaluating Offline RL with Comparative Feedback
D4RL Experiments
Atari Experiments
SMARTS Experiments
Evaluating Offline RL with Attribute Feedback
Conclusion, Challenge, and Future Directions
Environment and Datasets Details
The Details of the Datasets for D4RL
...and 23 more sections

Figures (15)

Figure 1: Overview of the Uni-RLHF system. Uni-RLHF consists of three components including the platform, the datasets, and the offline RLHF baselines.
Figure 2: Annotation accuracy in left-c task
Figure 3: Learning curves of three sub-tasks with different objectives. The target attribute strengths are set to $[1.0, 0.5, 0.5, 0.5, 1.0]$, $[0.5, 1.0, 0.5, 0.5, 1.0]$ and $[0.5, 0.5, 0.5, 1.0, 1.0]$, respectively. The policy can continuously optimize multiple objectives during the training process.
Figure 4: We visualized the behavior switching by adjusting the target attributes every 200 steps. The attribute values for speed were set to $[0.1, 1.0, 0.5, 0.1, 1.0]$, and for height, they were set to $[1.0, 0.6, 1.0, 0.1, 1.0]$. The corresponding changes in attributes can be clearly observed in the curves. We refer to our homepage for the full visualisations of walker.
Figure 5: Visualization an instance of the left turn scene in the SMARTS environment. The first row shows the driving trajectory of the ego vehicle controlled by the Oracle model. The second row displays the trajectory of the ego vehicle under the control of the CS model. CS model opts to stop and wait while Oracle model results in a collision.
...and 10 more figures

Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback

TL;DR

Abstract

Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback

Authors

TL;DR

Abstract

Table of Contents

Figures (15)