Table of Contents
Fetching ...

ViVa: Video-Trained Value Functions for Guiding Online RL from Diverse Data

Nitish Dashora, Dibya Ghosh, Sergey Levine

TL;DR

ViVa introduces a data-driven prior for sparse online RL by learning an Intent-conditioned Value Function (ICVF) from diverse video data and environment interactions. The core idea is to encode a temporal-distance prior to goal-reaching as a value function V(s, g) that augments the extrinsic reward during online RL, enabling guided exploration without task-specific demonstrations. The method pre-trains the value function on large-scale Ego4D video data and fine-tunes it with environment-specific data, then integrates it into online RL by adding a value-based penalty to the reward. Empirical results across AntMaze, RoboVerse, and Franka Kitchen show ViVa generalizes to unseen goals, benefits from video pretraining especially in low-data regimes, and scales with the amount and diversity of data, outperforming several offline-to-online baselines. This work highlights the practicality of leveraging broad video data to shape goal-directed behavior in online RL and opens avenues for richer multi-modal, language-conditioned intentions and deeper zero-shot transfer.

Abstract

Online reinforcement learning (RL) with sparse rewards poses a challenge partly because of the lack of feedback on states leading to the goal. Furthermore, expert offline data with reward signal is rarely available to provide this feedback and bootstrap online learning. How can we guide online agents to the right solution without this on-task data? Reward shaping offers a solution by providing fine-grained signal to nudge the policy towards the optimal solution. However, reward shaping often requires domain knowledge to hand-engineer heuristics for a specific goal. To enable more general and inexpensive guidance, we propose and analyze a data-driven methodology that automatically guides RL by learning from widely available video data such as Internet recordings, off-task demonstrations, task failures, and undirected environment interaction. By learning a model of optimal goal-conditioned value from diverse passive data, we open the floor to scaling up and using various data sources to model general goal-reaching behaviors relevant to guiding online RL. Specifically, we use intent-conditioned value functions to learn from diverse videos and incorporate these goal-conditioned values into the reward. Our experiments show that video-trained value functions work well with a variety of data sources, exhibit positive transfer from human video pre-training, can generalize to unseen goals, and scale with dataset size.

ViVa: Video-Trained Value Functions for Guiding Online RL from Diverse Data

TL;DR

ViVa introduces a data-driven prior for sparse online RL by learning an Intent-conditioned Value Function (ICVF) from diverse video data and environment interactions. The core idea is to encode a temporal-distance prior to goal-reaching as a value function V(s, g) that augments the extrinsic reward during online RL, enabling guided exploration without task-specific demonstrations. The method pre-trains the value function on large-scale Ego4D video data and fine-tunes it with environment-specific data, then integrates it into online RL by adding a value-based penalty to the reward. Empirical results across AntMaze, RoboVerse, and Franka Kitchen show ViVa generalizes to unseen goals, benefits from video pretraining especially in low-data regimes, and scales with the amount and diversity of data, outperforming several offline-to-online baselines. This work highlights the practicality of leveraging broad video data to shape goal-directed behavior in online RL and opens avenues for richer multi-modal, language-conditioned intentions and deeper zero-shot transfer.

Abstract

Online reinforcement learning (RL) with sparse rewards poses a challenge partly because of the lack of feedback on states leading to the goal. Furthermore, expert offline data with reward signal is rarely available to provide this feedback and bootstrap online learning. How can we guide online agents to the right solution without this on-task data? Reward shaping offers a solution by providing fine-grained signal to nudge the policy towards the optimal solution. However, reward shaping often requires domain knowledge to hand-engineer heuristics for a specific goal. To enable more general and inexpensive guidance, we propose and analyze a data-driven methodology that automatically guides RL by learning from widely available video data such as Internet recordings, off-task demonstrations, task failures, and undirected environment interaction. By learning a model of optimal goal-conditioned value from diverse passive data, we open the floor to scaling up and using various data sources to model general goal-reaching behaviors relevant to guiding online RL. Specifically, we use intent-conditioned value functions to learn from diverse videos and incorporate these goal-conditioned values into the reward. Our experiments show that video-trained value functions work well with a variety of data sources, exhibit positive transfer from human video pre-training, can generalize to unseen goals, and scale with dataset size.

Paper Structure

This paper contains 18 sections, 11 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Left: ViVa uses samples from internet-scale video to learn a value-function that encodes goal-reaching priors. Middle: ViVa finetunes on robotics-relevant data to bring the value function into the domain of the tasks we wish to solve. Right: During online RL, we freeze the value function and augment the extrinsic reward with a guidance signal that captures temporal distances. We choose to include the robotics-relevant interaction data in our online pipeline to assist exploration.
  • Figure 2: Left: A visualization of trajectories from the corrupted dataset shown in green. Middle: The learned ICVF values across all states with the goal at the red star. Right: The optimal dense reward (i.e. L2 distance) for all states with the goal at the red star.
  • Figure 3: All plots detail the mean evaluation return computed over 10 evaluation episodes. Left: Online RL for pick-and-place on COG as we scale to more and more on-task data. The rows below show example off-task successful trajectories with the WidowX robot from the drawer_prior and blocked_drawer datasets. Right: Online RL for pick-and-place on COG when including Ego4D pretraining and off-task data sources. The rows below are a failure and a success from the prior dataset.
  • Figure 4: The online evaluation return in AntMaze when training ViVa with corrupted data. As seen, learning a value-function prior for online RL provides a more generalizable reward model when offline rewarded data is absent. Learning a behavioral prior also works in this setting.
  • Figure 5: All plots detail the mean evaluation return computed over 10 evaluation episodes. Left: Online RL for the Hinge Cabinet task in FrankaKitchen. The bottom row is an image trajectory of a demonstration of opening the hinge cabinet. Right: Online RL for the Sliding Cabinet task in FrankaKitchen. The bottom row is an image trajectory of a demonstration of opening the sliding cabinet.
  • ...and 8 more figures