Table of Contents
Fetching ...

Robotic Offline RL from Internet Videos via Value-Function Pre-Training

Chethan Bhateja, Derek Guo, Dibya Ghosh, Anikait Singh, Manan Tomar, Quan Vuong, Yevgen Chebotar, Sergey Levine, Aviral Kumar

TL;DR

This work tackles robust generalization in robotic offline RL by pre-training value representations on unlabelled internet video, then aligning them with robot data through offline RL and task-specific fine-tuning. The core idea, V-PTR, uses an intent-conditioned value function learned via TD-learning on Ego4D video and subsequently refines representations with multi-task robot data before adapting to a target task. Empirical results on real WidowX experiments show improved zero-shot generalization, robustness to distractors, and the ability to handle novel objects, outperforming prior video-based and RL-based baselines. The study provides a practical pathway to leverage abundant human video for enhanced robotic learning, while outlining avenues for scaling and language-enabled task specifications.

Abstract

Pre-training on Internet data has proven to be a key ingredient for broad generalization in many modern ML systems. What would it take to enable such capabilities in robotic reinforcement learning (RL)? Offline RL methods, which learn from datasets of robot experience, offer one way to leverage prior data into the robotic learning pipeline. However, these methods have a "type mismatch" with video data (such as Ego4D), the largest prior datasets available for robotics, since video offers observation-only experience without the action or reward annotations needed for RL methods. In this paper, we develop a system for leveraging large-scale human video datasets in robotic offline RL, based entirely on learning value functions via temporal-difference learning. We show that value learning on video datasets learns representations that are more conducive to downstream robotic offline RL than other approaches for learning from video data. Our system, called V-PTR, combines the benefits of pre-training on video data with robotic offline RL approaches that train on diverse robot data, resulting in value functions and policies for manipulation tasks that perform better, act robustly, and generalize broadly. On several manipulation tasks on a real WidowX robot, our framework produces policies that greatly improve over prior methods. Our video and additional details can be found at https://dibyaghosh.com/vptr/

Robotic Offline RL from Internet Videos via Value-Function Pre-Training

TL;DR

This work tackles robust generalization in robotic offline RL by pre-training value representations on unlabelled internet video, then aligning them with robot data through offline RL and task-specific fine-tuning. The core idea, V-PTR, uses an intent-conditioned value function learned via TD-learning on Ego4D video and subsequently refines representations with multi-task robot data before adapting to a target task. Empirical results on real WidowX experiments show improved zero-shot generalization, robustness to distractors, and the ability to handle novel objects, outperforming prior video-based and RL-based baselines. The study provides a practical pathway to leverage abundant human video for enhanced robotic learning, while outlining avenues for scaling and language-enabled task specifications.

Abstract

Pre-training on Internet data has proven to be a key ingredient for broad generalization in many modern ML systems. What would it take to enable such capabilities in robotic reinforcement learning (RL)? Offline RL methods, which learn from datasets of robot experience, offer one way to leverage prior data into the robotic learning pipeline. However, these methods have a "type mismatch" with video data (such as Ego4D), the largest prior datasets available for robotics, since video offers observation-only experience without the action or reward annotations needed for RL methods. In this paper, we develop a system for leveraging large-scale human video datasets in robotic offline RL, based entirely on learning value functions via temporal-difference learning. We show that value learning on video datasets learns representations that are more conducive to downstream robotic offline RL than other approaches for learning from video data. Our system, called V-PTR, combines the benefits of pre-training on video data with robotic offline RL approaches that train on diverse robot data, resulting in value functions and policies for manipulation tasks that perform better, act robustly, and generalize broadly. On several manipulation tasks on a real WidowX robot, our framework produces policies that greatly improve over prior methods. Our video and additional details can be found at https://dibyaghosh.com/vptr/
Paper Structure (29 sections, 6 equations, 9 figures, 5 tables)

This paper contains 29 sections, 6 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Network architecture. V-PTR first pre-trains image representations by training a general value function from video and then refines this representation via multi-task pre-training on robot data.
  • Figure 2: Examples of setup and successful rollouts for complex tasks. We utilize the robot setup from the Bridge dataset ebert2021bridge for our tasks. Top: Two-phase open microwave; Bottom: Sweep beans into pile with tool.
  • Figure 3: Visualizing qualitative performance of V-PTR and VIP. Here we show rollouts for V-PTR (top) and VIP (bottom) on the real robot manipulation tasks. V-PTR carefully executes the task by orienting the gripper to match the object and retrying on failure whereas VIP grasp objects without this re-orientation, leading to failure.
  • Figure 4: Examples of areas swept by VIP ma2022vip (top) and V-PTR (bottom) methods. V-PTR sweeps a much larger area (blue), and consistently begins a second sweep, whereas VIP ma2022vip is too slow to sweep a second time.
  • Figure 5: Visualizing the learned values $V(\mathbf{s}_t)$ w.r.t. time-step $t$ on rollouts from training data (top), held-out data (middle), and rollouts with distractor objects (bottom) obtained after multi-task pre-training in phase 2 ($V(\mathbf{s}_t)$ is computed using the average of the multi-task Q-value under actions sampled from the learned policy). Note that values trained by PTR and VIP tend to be highly non-smooth, especially on held-out rollouts with novel distractors, whereas V-PTR produces smooth value functions.
  • ...and 4 more figures