Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning
Calarina Muslimani, Matthew E. Taylor
TL;DR
The paper addresses the costly data requirements of reward learning for human-in-the-loop RL by introducing Sub-optimal Data Pre-training (SDP), which pre-trains reward models on pseudo-labeled, low-quality trajectories with the minimum reward $r_{ ext{min}}$ and seeds the agent's replay buffer to bias early exploration. By combining SDP with both preference- and scalar-based RL algorithms, the authors demonstrate improved feedback efficiency and strong performance across simulated DMControl and Meta-World tasks, as well as a real 16-person human study showing SDP advantages over baselines like PEBBLE. SDP's two-phase design—reward-model pre-training and agent-update with sub-optimal data—serves as a robust prior that reduces the need for extensive human feedback while maintaining or improving learning outcomes. The work highlights practical impact for deploying human-in-the-loop RL in real-world robotics and complex tasks, where reward design is challenging and human labeling is expensive. Overall, SDP provides a principled, data-efficient pathway to harness sub-optimal data for faster, more reliable human-in-the-loop reward learning.
Abstract
To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process. Instead, human-in-the-loop RL methods hold the promise of learning reward functions from human feedback. Despite recent successes, many of the human-in-the-loop RL methods still require numerous human interactions to learn successful reward functions. To improve the feedback efficiency of human-in-the-loop RL methods (i.e., require less human interaction), this paper introduces Sub-optimal Data Pre-training, SDP, an approach that leverages reward-free, sub-optimal data to improve scalar- and preference-based RL algorithms. In SDP, we start by pseudo-labeling all low-quality data with the minimum environment reward. Through this process, we obtain reward labels to pre-train our reward model without requiring human labeling or preferences. This pre-training phase provides the reward model a head start in learning, enabling it to recognize that low-quality transitions should be assigned low rewards. Through extensive experiments with both simulated and human teachers, we find that SDP can at least meet, but often significantly improve, state of the art human-in-the-loop RL performance across a variety of simulated robotic tasks.
