Table of Contents
Fetching ...

Fine-tuning Behavioral Cloning Policies with Preference-Based Reinforcement Learning

Maël Macuglia, Paul Friedrich, Giorgia Ramponi

TL;DR

This work tackles reward-specified RL's two core hurdles—reward mis-specification and unsafe exploratory trials—by proposing a reward-free, offline-to-online framework that first learns a safe initial policy from expert demonstrations and then refines it online using human trajectory preferences. The BRIDGE algorithm unifies offline imitation with online preference-based reinforcement learning by constructing a Hellinger-ball confidence set from offline data and constraining online exploration to this safe region, while updating a GLM-style preference model with both offline and online data. The authors derive regret bounds showing that offline data reduces online sample complexity, with BRIDGE achieving a sqrt-T regret that diminishes as offline data grows, and validate the approach in both discrete and continuous control tasks, outperforming standalone BC and online PbRL baselines. The findings offer a principled, data-efficient path for training interactive agents that can learn from demonstrations and minimal human feedback without requiring an explicit reward specification. This could significantly enhance the safety and practicality of deploying RL-based systems in robotics, industry, and healthcare.

Abstract

Deploying reinforcement learning (RL) in robotics, industry, and health care is blocked by two obstacles: the difficulty of specifying accurate rewards and the risk of unsafe, data-hungry exploration. We address this by proposing a two-stage framework that first learns a safe initial policy from a reward-free dataset of expert demonstrations, then fine-tunes it online using preference-based human feedback. We provide the first principled analysis of this offline-to-online approach and introduce BRIDGE, a unified algorithm that integrates both signals via an uncertainty-weighted objective. We derive regret bounds that shrink with the number of offline demonstrations, explicitly connecting the quantity of offline data to online sample efficiency. We validate BRIDGE in discrete and continuous control MuJoCo environments, showing it achieves lower regret than both standalone behavioral cloning and online preference-based RL. Our work establishes a theoretical foundation for designing more sample-efficient interactive agents.

Fine-tuning Behavioral Cloning Policies with Preference-Based Reinforcement Learning

TL;DR

This work tackles reward-specified RL's two core hurdles—reward mis-specification and unsafe exploratory trials—by proposing a reward-free, offline-to-online framework that first learns a safe initial policy from expert demonstrations and then refines it online using human trajectory preferences. The BRIDGE algorithm unifies offline imitation with online preference-based reinforcement learning by constructing a Hellinger-ball confidence set from offline data and constraining online exploration to this safe region, while updating a GLM-style preference model with both offline and online data. The authors derive regret bounds showing that offline data reduces online sample complexity, with BRIDGE achieving a sqrt-T regret that diminishes as offline data grows, and validate the approach in both discrete and continuous control tasks, outperforming standalone BC and online PbRL baselines. The findings offer a principled, data-efficient path for training interactive agents that can learn from demonstrations and minimal human feedback without requiring an explicit reward specification. This could significantly enhance the safety and practicality of deploying RL-based systems in robotics, industry, and healthcare.

Abstract

Deploying reinforcement learning (RL) in robotics, industry, and health care is blocked by two obstacles: the difficulty of specifying accurate rewards and the risk of unsafe, data-hungry exploration. We address this by proposing a two-stage framework that first learns a safe initial policy from a reward-free dataset of expert demonstrations, then fine-tunes it online using preference-based human feedback. We provide the first principled analysis of this offline-to-online approach and introduce BRIDGE, a unified algorithm that integrates both signals via an uncertainty-weighted objective. We derive regret bounds that shrink with the number of offline demonstrations, explicitly connecting the quantity of offline data to online sample efficiency. We validate BRIDGE in discrete and continuous control MuJoCo environments, showing it achieves lower regret than both standalone behavioral cloning and online preference-based RL. Our work establishes a theoretical foundation for designing more sample-efficient interactive agents.

Paper Structure

This paper contains 77 sections, 40 theorems, 349 equations, 9 figures, 2 tables, 2 algorithms.

Key Result

Theorem 4.1

Let $n$ be the number of offline demonstrations from an expert policy satisfying Assumption assumption:min_visitation, where $\gamma_{\min} > 0$ is the minimum nonzero visitation probability under the expert policy's distribution. Then, with probability at least $1-\delta$, the regret of BRIDGE is b

Figures (9)

  • Figure 1: Overview of the BRIDGE framework. Offline estimation derives estimators $\pi^\text{BC}$ and $\hat{P}$ using the dataset $\mathbb{D}_n^H$ and constructs a confidence set in trajectory distribution space $\mathcal{P}(\mathcal{T})$ as a Hellinger ball (left), which translates to the offline policy confidence set $\Pi^\text{offline}$ in policy space $\Pi$ likely to contain $\pi^*$ (middle). The confidence set $\Pi^\text{offline}$ is then used to constrain the online preference learning phase (right), where policies are sampled from within this set and presented to the expert for preference feedback.
  • Figure 2: Cumulative regret versus baselines across four environments. Our method, BRIDGE, achieves lower regret than the offline BC foster2024behavior and online PbRL saha2023dueling baselines in both discrete tasks (a & b) and continuous control tasks (c & d). Dotted lines show BC (green) and expert (red) regret. Mean and 95% CI over 20 seeds.
  • Figure 3: Policy set size refinement for discrete (StarMDP, left) and continuous (Reacher, right) environments. Our BRIDGE rapidly prunes the policy search space compared to the online PbRL baseline, which explores more broadly. Mean and 95% CI over 20 seeds.
  • Figure 4: BRIDGE performance for different values of the radius used to filter candidate policies and create the offline confidence set. Higher radii lead to less filtering and performance that approaches the online PbRL baseline's, while a radius too small excludes (near-)optimal candidates, leading to unavoidable regret.
  • Figure 5: BRIDGE performance for different amounts of offline demonstration trajectories given. As the number of offline trajectories increases, BRIDGE's regret is reduced.
  • ...and 4 more figures

Theorems & Definitions (78)

  • Theorem 4.1: Main result: Offline data reduces online regret
  • Theorem 4.2: Offline confidence set radius
  • Lemma 4.3: Concentrability coefficient bound
  • Lemma 4.4: Offline policy confidence set
  • Lemma 4.5: Online policy confidence set
  • Claim
  • proof
  • Lemma B.1: Offline policy confidence set under known dynamics
  • proof
  • Lemma B.2: Optimal policy containment
  • ...and 68 more