Table of Contents
Fetching ...

Predictive Preference Learning from Human Interventions

Haoyuan Cai, Zhenghao Peng, Bolei Zhou

TL;DR

PPL tackles safety and sample efficiency in interactive imitation learning by marrying trajectory prediction with online preference learning. A trajectory predictor forecasts the agent's next $H$ steps, and each human intervention at the current state is bootstrap-answered into a horizon of $L$ future states as contrastive preference data, enabling corrections to propagate into risky regions before failures occur. The approach optimizes a combination of behavioral cloning on expert data and a contrastive preference loss over predicted states, yielding improved learning efficiency and reduced expert workload. Theoretical bounds link the performance gap to state-distribution shift and preference-label quality, while experiments on MetaDrive and Robosuite demonstrate robustness and generality across control and manipulation tasks, with real and neural-human proxies supporting practical applicability.

Abstract

Learning from human involvement aims to incorporate the human subject to monitor and correct agent behavior errors. Although most interactive imitation learning methods focus on correcting the agent's action at the current state, they do not adjust its actions in future states, which may be potentially more hazardous. To address this, we introduce Predictive Preference Learning from Human Interventions (PPL), which leverages the implicit preference signals contained in human interventions to inform predictions of future rollouts. The key idea of PPL is to bootstrap each human intervention into L future time steps, called the preference horizon, with the assumption that the agent follows the same action and the human makes the same intervention in the preference horizon. By applying preference optimization on these future states, expert corrections are propagated into the safety-critical regions where the agent is expected to explore, significantly improving learning efficiency and reducing human demonstrations needed. We evaluate our approach with experiments on both autonomous driving and robotic manipulation benchmarks and demonstrate its efficiency and generality. Our theoretical analysis further shows that selecting an appropriate preference horizon L balances coverage of risky states with label correctness, thereby bounding the algorithmic optimality gap. Demo and code are available at: https://metadriverse.github.io/ppl

Predictive Preference Learning from Human Interventions

TL;DR

PPL tackles safety and sample efficiency in interactive imitation learning by marrying trajectory prediction with online preference learning. A trajectory predictor forecasts the agent's next steps, and each human intervention at the current state is bootstrap-answered into a horizon of future states as contrastive preference data, enabling corrections to propagate into risky regions before failures occur. The approach optimizes a combination of behavioral cloning on expert data and a contrastive preference loss over predicted states, yielding improved learning efficiency and reduced expert workload. Theoretical bounds link the performance gap to state-distribution shift and preference-label quality, while experiments on MetaDrive and Robosuite demonstrate robustness and generality across control and manipulation tasks, with real and neural-human proxies supporting practical applicability.

Abstract

Learning from human involvement aims to incorporate the human subject to monitor and correct agent behavior errors. Although most interactive imitation learning methods focus on correcting the agent's action at the current state, they do not adjust its actions in future states, which may be potentially more hazardous. To address this, we introduce Predictive Preference Learning from Human Interventions (PPL), which leverages the implicit preference signals contained in human interventions to inform predictions of future rollouts. The key idea of PPL is to bootstrap each human intervention into L future time steps, called the preference horizon, with the assumption that the agent follows the same action and the human makes the same intervention in the preference horizon. By applying preference optimization on these future states, expert corrections are propagated into the safety-critical regions where the agent is expected to explore, significantly improving learning efficiency and reducing human demonstrations needed. We evaluate our approach with experiments on both autonomous driving and robotic manipulation benchmarks and demonstrate its efficiency and generality. Our theoretical analysis further shows that selecting an appropriate preference horizon L balances coverage of risky states with label correctness, thereby bounding the algorithmic optimality gap. Demo and code are available at: https://metadriverse.github.io/ppl

Paper Structure

This paper contains 28 sections, 7 theorems, 44 equations, 8 figures, 6 tables, 1 algorithm.

Key Result

Theorem 4.1

We denote the Q-function of the human policy $\pi_h$ as $Q^*(s, a)$. We assume that for any $(s, a, a')$, $\left| Q^*(s, a) - Q^*(s, a') \right| \leq U$, $|\log \pi_h (a | s) - \log \pi_h(a' | s) | \leq M$, and $|\log \pi_n (a | s) - \log \pi_n(a'|s) | \leq M$, where $U, M > 0$ are constants. When

Figures (8)

  • Figure 1: Our Predictive Preference Learning from Human Interventions. (Top) Our approach forecasts the agent’s upcoming trajectory (the red dotted path) and visualizes it for the human expert, who will intervene if the forecasted path indicates an upcoming failure. (Bottom) A single intervention is then interpreted as hypothesized preference signals across the predicted states. These signals reflect the agent’s imputed imagination of what the expert would prefer, guiding the policy to avoid the risky maneuver in similar future contexts. This integration of proactive forecasting and preference learning accelerates policy improvement and reduces the total number of expert interventions required.
  • Figure 2: Illustration of Predictive Preference Learning. (Left) At each decision point, the agent proposes an action, and its future trajectory is predicted and visualized. The human expert reviews this rollout and intervenes only when a potential failure is anticipated. The intervention is recorded alongside the state into the human buffer $\mathcal{D}_h$ for behavioral cloning. (Right) Each recorded intervention is then converted into contrastive preference pairs over the predicted future states $\tilde{s}_1, \cdots, \tilde{s}_L$. These preference tuples are stored in a preference buffer $\mathcal{D}_\text{pref}$ and used to train the policy via a contrastive classification loss, propagating expert intents into regions the agent is likely to explore.
  • Figure 3: The test-time performance curve of PPL and the IIL counterpart PVP peng2024learning under three different environments. The x-coordinate is the number of environment interactions, and the y-coordinate is the agent's success rate in a held-out test environment, where the evaluation is conducted without expert involvement. Compared to the IIL counterpart, our approach achieves much higher learning efficiency and reduces the expert's efforts needed.
  • Figure 4: Human interfaces of the three tasks: MetaDrive (a), Table Wiping (b), and Nut Assembly (c). In (a), the agent's forecasted trajectory (the red dots) leads to a collision, prompting the expert to intervene via the gamepad (blue dots show the predicted rollout of the expert). In (b) and (c), the expert observes the agent's forecasted trajectory and intervenes via the keyboard if necessary.
  • Figure 5: Training process of PPL in the MetaDrive environment with the human expert over 20K steps. We plot the test success rate (left), training takeover rate (top right), and training episodic safety cost (bottom right). During training, when the agent’s forecasted trajectory (red dots) leads to a collision, the human expert intervenes via the gamepad, and the corrected rollout is shown (blue dots). When the agent’s forecasted trajectory is safe, it is visualized in green dots. The agent becomes autonomous and performant during training, requiring fewer human interventions to maintain safety.
  • ...and 3 more figures

Theorems & Definitions (16)

  • Theorem 4.1
  • Theorem F.1: Formal Statement of Theorem \ref{['theory']}
  • proof
  • Lemma F.2: Performance Optimality Gap on the State Distribution Shift
  • proof : Proof Sketch
  • proof
  • Lemma F.3: Misalignment of Preference Pairs
  • proof : Proof Sketch
  • proof
  • Lemma F.4: Optimization Error Bounds the Total Variation
  • ...and 6 more