ViVa: Video-Trained Value Functions for Guiding Online RL from Diverse Data
Nitish Dashora, Dibya Ghosh, Sergey Levine
TL;DR
ViVa introduces a data-driven prior for sparse online RL by learning an Intent-conditioned Value Function (ICVF) from diverse video data and environment interactions. The core idea is to encode a temporal-distance prior to goal-reaching as a value function V(s, g) that augments the extrinsic reward during online RL, enabling guided exploration without task-specific demonstrations. The method pre-trains the value function on large-scale Ego4D video data and fine-tunes it with environment-specific data, then integrates it into online RL by adding a value-based penalty to the reward. Empirical results across AntMaze, RoboVerse, and Franka Kitchen show ViVa generalizes to unseen goals, benefits from video pretraining especially in low-data regimes, and scales with the amount and diversity of data, outperforming several offline-to-online baselines. This work highlights the practicality of leveraging broad video data to shape goal-directed behavior in online RL and opens avenues for richer multi-modal, language-conditioned intentions and deeper zero-shot transfer.
Abstract
Online reinforcement learning (RL) with sparse rewards poses a challenge partly because of the lack of feedback on states leading to the goal. Furthermore, expert offline data with reward signal is rarely available to provide this feedback and bootstrap online learning. How can we guide online agents to the right solution without this on-task data? Reward shaping offers a solution by providing fine-grained signal to nudge the policy towards the optimal solution. However, reward shaping often requires domain knowledge to hand-engineer heuristics for a specific goal. To enable more general and inexpensive guidance, we propose and analyze a data-driven methodology that automatically guides RL by learning from widely available video data such as Internet recordings, off-task demonstrations, task failures, and undirected environment interaction. By learning a model of optimal goal-conditioned value from diverse passive data, we open the floor to scaling up and using various data sources to model general goal-reaching behaviors relevant to guiding online RL. Specifically, we use intent-conditioned value functions to learn from diverse videos and incorporate these goal-conditioned values into the reward. Our experiments show that video-trained value functions work well with a variety of data sources, exhibit positive transfer from human video pre-training, can generalize to unseen goals, and scale with dataset size.
