Table of Contents
Fetching ...

Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning

Calarina Muslimani, Matthew E. Taylor

TL;DR

The paper addresses the costly data requirements of reward learning for human-in-the-loop RL by introducing Sub-optimal Data Pre-training (SDP), which pre-trains reward models on pseudo-labeled, low-quality trajectories with the minimum reward $r_{ ext{min}}$ and seeds the agent's replay buffer to bias early exploration. By combining SDP with both preference- and scalar-based RL algorithms, the authors demonstrate improved feedback efficiency and strong performance across simulated DMControl and Meta-World tasks, as well as a real 16-person human study showing SDP advantages over baselines like PEBBLE. SDP's two-phase design—reward-model pre-training and agent-update with sub-optimal data—serves as a robust prior that reduces the need for extensive human feedback while maintaining or improving learning outcomes. The work highlights practical impact for deploying human-in-the-loop RL in real-world robotics and complex tasks, where reward design is challenging and human labeling is expensive. Overall, SDP provides a principled, data-efficient pathway to harness sub-optimal data for faster, more reliable human-in-the-loop reward learning.

Abstract

To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process. Instead, human-in-the-loop RL methods hold the promise of learning reward functions from human feedback. Despite recent successes, many of the human-in-the-loop RL methods still require numerous human interactions to learn successful reward functions. To improve the feedback efficiency of human-in-the-loop RL methods (i.e., require less human interaction), this paper introduces Sub-optimal Data Pre-training, SDP, an approach that leverages reward-free, sub-optimal data to improve scalar- and preference-based RL algorithms. In SDP, we start by pseudo-labeling all low-quality data with the minimum environment reward. Through this process, we obtain reward labels to pre-train our reward model without requiring human labeling or preferences. This pre-training phase provides the reward model a head start in learning, enabling it to recognize that low-quality transitions should be assigned low rewards. Through extensive experiments with both simulated and human teachers, we find that SDP can at least meet, but often significantly improve, state of the art human-in-the-loop RL performance across a variety of simulated robotic tasks.

Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning

TL;DR

The paper addresses the costly data requirements of reward learning for human-in-the-loop RL by introducing Sub-optimal Data Pre-training (SDP), which pre-trains reward models on pseudo-labeled, low-quality trajectories with the minimum reward and seeds the agent's replay buffer to bias early exploration. By combining SDP with both preference- and scalar-based RL algorithms, the authors demonstrate improved feedback efficiency and strong performance across simulated DMControl and Meta-World tasks, as well as a real 16-person human study showing SDP advantages over baselines like PEBBLE. SDP's two-phase design—reward-model pre-training and agent-update with sub-optimal data—serves as a robust prior that reduces the need for extensive human feedback while maintaining or improving learning outcomes. The work highlights practical impact for deploying human-in-the-loop RL in real-world robotics and complex tasks, where reward design is challenging and human labeling is expensive. Overall, SDP provides a principled, data-efficient pathway to harness sub-optimal data for faster, more reliable human-in-the-loop reward learning.

Abstract

To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process. Instead, human-in-the-loop RL methods hold the promise of learning reward functions from human feedback. Despite recent successes, many of the human-in-the-loop RL methods still require numerous human interactions to learn successful reward functions. To improve the feedback efficiency of human-in-the-loop RL methods (i.e., require less human interaction), this paper introduces Sub-optimal Data Pre-training, SDP, an approach that leverages reward-free, sub-optimal data to improve scalar- and preference-based RL algorithms. In SDP, we start by pseudo-labeling all low-quality data with the minimum environment reward. Through this process, we obtain reward labels to pre-train our reward model without requiring human labeling or preferences. This pre-training phase provides the reward model a head start in learning, enabling it to recognize that low-quality transitions should be assigned low rewards. Through extensive experiments with both simulated and human teachers, we find that SDP can at least meet, but often significantly improve, state of the art human-in-the-loop RL performance across a variety of simulated robotic tasks.
Paper Structure (48 sections, 4 equations, 16 figures, 14 tables, 1 algorithm)

This paper contains 48 sections, 4 equations, 16 figures, 14 tables, 1 algorithm.

Figures (16)

  • Figure 1: Overview of SDP: After obtaining a data set of sub-optimal trajectories, we pseudo-label the transitions with rewards of $r_{\text{min}}$ (e.g., $r_{\text{min}}=0$). We then pre-train the reward model $\hat{r}_{\theta}$ using this data set. During the agent update phase, we initialize the RL agent's replay buffer with the same pseudo-labeled data set. The agent then interacts in the environment and makes learning updates to obtain new behaviors for a teacher to give feedback.
  • Figure 2: Results from the preference feedback experiments in the DMControl and Meta-World suites show mean AUC $\pm$ 95% confidence intervals. * indicates that SDP + the base preference learning algorithm achieves a statistically greater score than the base preference learning algorithm.
  • Figure 3: In the scalar feedback experiments on the DMControl environments, SDP significantly outperforms R-PEBBLE and Deep TAMER and achieves comparable performance to SAC.
  • Figure 4: This highlights that SDP can leverage sub-optimal data from different prior tasks as it performed comparable to SDP when using target task sub-optimal data.
  • Figure 5: This demonstrates that SDP can significantly outperform PEBBLE in terms of both AUC (left) and final performance (right) even when human teachers are providing preferences. * denotes a statistically significant difference between SDP and PEBBLE.
  • ...and 11 more figures