Fine-tuning Behavioral Cloning Policies with Preference-Based Reinforcement Learning
Maël Macuglia, Paul Friedrich, Giorgia Ramponi
TL;DR
This work tackles reward-specified RL's two core hurdles—reward mis-specification and unsafe exploratory trials—by proposing a reward-free, offline-to-online framework that first learns a safe initial policy from expert demonstrations and then refines it online using human trajectory preferences. The BRIDGE algorithm unifies offline imitation with online preference-based reinforcement learning by constructing a Hellinger-ball confidence set from offline data and constraining online exploration to this safe region, while updating a GLM-style preference model with both offline and online data. The authors derive regret bounds showing that offline data reduces online sample complexity, with BRIDGE achieving a sqrt-T regret that diminishes as offline data grows, and validate the approach in both discrete and continuous control tasks, outperforming standalone BC and online PbRL baselines. The findings offer a principled, data-efficient path for training interactive agents that can learn from demonstrations and minimal human feedback without requiring an explicit reward specification. This could significantly enhance the safety and practicality of deploying RL-based systems in robotics, industry, and healthcare.
Abstract
Deploying reinforcement learning (RL) in robotics, industry, and health care is blocked by two obstacles: the difficulty of specifying accurate rewards and the risk of unsafe, data-hungry exploration. We address this by proposing a two-stage framework that first learns a safe initial policy from a reward-free dataset of expert demonstrations, then fine-tunes it online using preference-based human feedback. We provide the first principled analysis of this offline-to-online approach and introduce BRIDGE, a unified algorithm that integrates both signals via an uncertainty-weighted objective. We derive regret bounds that shrink with the number of offline demonstrations, explicitly connecting the quantity of offline data to online sample efficiency. We validate BRIDGE in discrete and continuous control MuJoCo environments, showing it achieves lower regret than both standalone behavioral cloning and online preference-based RL. Our work establishes a theoretical foundation for designing more sample-efficient interactive agents.
