Table of Contents
Fetching ...

Human Preference Modeling Using Visual Motion Prediction Improves Robot Skill Learning from Egocentric Human Video

Mrinal Verghese, Christopher G. Atkeson

TL;DR

This work tackles data-efficient robot skill learning by leveraging abundant egocentric human videos. It replaces long-horizon value estimation with a dense, per-step reward derived from predicted motion of task-relevant object points, computed as the alignment between predicted and observed point deltas. A motion-prediction transformer is trained on human videos to capture short-horizon human preferences, and a residual SAC framework fine-tunes a base behavior-cloned policy on real hardware using this reward. Across simulation and three real-world tasks, Motion Prediction Reward (MPR) consistently outperforms temporal-distance rewards and shows strong sample efficiency, enabling notable improvements with only 10 demonstrations and about an hour of training.

Abstract

We present an approach to robot learning from egocentric human videos by modeling human preferences in a reward function and optimizing robot behavior to maximize this reward. Prior work on reward learning from human videos attempts to measure the long-term value of a visual state as the temporal distance between it and the terminal state in a demonstration video. These approaches make assumptions that limit performance when learning from video. They must also transfer the learned value function across the embodiment and environment gap. Our method models human preferences by learning to predict the motion of tracked points between subsequent images and defines a reward function as the agreement between predicted and observed object motion in a robot's behavior at each step. We then use a modified Soft Actor Critic (SAC) algorithm initialized with 10 on-robot demonstrations to estimate a value function from this reward and optimize a policy that maximizes this value function, all on the robot. Our approach is capable of learning on a real robot, and we show that policies learned with our reward model match or outperform prior work across multiple tasks in both simulation and on the real robot.

Human Preference Modeling Using Visual Motion Prediction Improves Robot Skill Learning from Egocentric Human Video

TL;DR

This work tackles data-efficient robot skill learning by leveraging abundant egocentric human videos. It replaces long-horizon value estimation with a dense, per-step reward derived from predicted motion of task-relevant object points, computed as the alignment between predicted and observed point deltas. A motion-prediction transformer is trained on human videos to capture short-horizon human preferences, and a residual SAC framework fine-tunes a base behavior-cloned policy on real hardware using this reward. Across simulation and three real-world tasks, Motion Prediction Reward (MPR) consistently outperforms temporal-distance rewards and shows strong sample efficiency, enabling notable improvements with only 10 demonstrations and about an hour of training.

Abstract

We present an approach to robot learning from egocentric human videos by modeling human preferences in a reward function and optimizing robot behavior to maximize this reward. Prior work on reward learning from human videos attempts to measure the long-term value of a visual state as the temporal distance between it and the terminal state in a demonstration video. These approaches make assumptions that limit performance when learning from video. They must also transfer the learned value function across the embodiment and environment gap. Our method models human preferences by learning to predict the motion of tracked points between subsequent images and defines a reward function as the agreement between predicted and observed object motion in a robot's behavior at each step. We then use a modified Soft Actor Critic (SAC) algorithm initialized with 10 on-robot demonstrations to estimate a value function from this reward and optimize a policy that maximizes this value function, all on the robot. Our approach is capable of learning on a real robot, and we show that policies learned with our reward model match or outperform prior work across multiple tasks in both simulation and on the real robot.
Paper Structure (32 sections, 2 equations, 11 figures, 4 tables)

This paper contains 32 sections, 2 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Modeling human preferences using Motion Prediction Reward improves robot skill learning from human video data. To learn a robust reward signal from human video for a given task $\mathcal{T}$ (in this case, "Open Microwave"), we first extract point tracks and object masks using off-the-shelf models from a set of egocentric human videos demonstrating $\mathcal{T}$. This data is used to train our Motion Prediction Transformer ($F_\theta$) that can predict how points on an object will move given a visual observation. This model is used to calculate reward for an episode of robot behavior, by tracking points in a video of the episode, and measuring the alignment between predicted and observed point motion at each transition. To learn a robot policy for task $\mathcal{T}$, we collect a very small set of demonstrations (small offline dataset) and train a behavior cloning policy ($\pi_{base}$) to kickstart learning. We then use a sample-efficient residual RL framework that leverages both the collected demonstrations and new episodes from online interaction (online buffer), labeled with our reward model, to improve its performance on the given task. This process is able to increase a robot's success rate on a task by over 30% with just an hour of real-world interaction, and significantly outperforms prior work on reward learning from human video.
  • Figure 2: Success rates for various reward models across training timesteps in simulated tasks from the Franka Kitchen benchmark. We evaluate different reward signals in our residual RL framework across two differnt simulated tasks. The reward signals include a sparse reward signal of 1 for task success and 0 otherwise, a handcrafted dense reward signal that measures progress to the goal using privileged simulation information, Value Implicit Pretraining (VIP) maVIPUniversalVisual2023, which is representative of the temporal distance class of reward learning methods, and our work, Motion Prediction Reward (MPR). The success rates shown are calculated across 20 evaluations per checkpoint and 8 different seeds for each method. Standard deviations across the 8 seeds are shaded, and the dashed line shows the base policy performance. While both MPR and VIP match the handcrafted sparse reward in the cabinet task, MPR outperforms VIP in the microwave tasks and closely tracks the handcrafted reward performance.
  • Figure 3: Comparison of reward signals across 50 successful demonstrations for the simulated "Open Microwave" task. This plot shows the average estimated reward signal for our method (MPR), VIP maVIPUniversalVisual2023, and a handcrafted dense reward signal that uses privileged simulation information. Our reward signal closely tracks the handcrafted reward and shows very little bias or false positive results.
  • Figure 4: Real-world training performance for the "Open Microwave" task across three runs each for VIP maVIPUniversalVisual2023 and our approach, Motion Prediction Reward (MPR). All policies were initialized with the same base policy capable of completing the task 45% of the time (9/20 attempts) and trained for 100 episodes (about an hour of wall clock time). Our residual RL framework leveraged the demo data used to train the base policy (10 demonstrations) and data collected online, both labeled with reward signals generated by each method. The plot shows a running average success rate across the last 20 episodes, with VIP in shades of orange and MPR in shades of blue. All MPR runs improve over the base policy and finish with an average success rate of 76.7% across 20 evaluations on the final checkpoint. In contrast, the VIP runs face significant issues with stability and demonstrate "unlearning" behavior, finishing with an average success rate of 23.3%, well below the base policy's performance.
  • Figure 5: Estimated reward signals and value estimates after training from VIP and MPR show VIP has trouble identifying failed episodes. Estimated rewards (top) and values computed by learned value functions (bottom) after 100 episodes of training by each reward model for a successful (left) and failed (right) episode. Note value estimates are computed using the full robot state (world image, wrist image, end-effector pose, and base action) and the policy's action. Both approaches are capable of identifying a successful episode as shown in their reward and value estimates. However, VIP falsely assigns a moderate value to later actions and states in the failed episode, after the robot has missed grabbing the microwave handle, while MPR assigns no reward and a low value to the failed states.
  • ...and 6 more figures