Solving New Tasks by Adapting Internet Video Knowledge

Calvin Luo; Zilai Zeng; Yilun Du; Chen Sun

Solving New Tasks by Adapting Internet Video Knowledge

Calvin Luo, Zilai Zeng, Yilun Du, Chen Sun

TL;DR

Solving New Tasks by Adapting Internet Video Knowledge addresses how to combine internet-scale video priors with limited in-domain demonstrations to achieve novel text-conditioned robotic behaviors. It compares three adaptation strategies—Direct Finetuning, Subject Customization, and Probabilistic Adaptation—and introduces Inverse Probabilistic Adaptation to leverage pretrained priors robustly. Through standardized evaluations on MetaWorld and DeepMind Control, the work demonstrates effective text-conditioned planning and policy supervision, with inverse probabilistic adaptation providing robust generalization even with suboptimal data. The findings present a data-efficient pathway to harness large-scale video priors for embodied AI, enabling broader and more flexible task generalization in real environments.

Abstract

Video generative models demonstrate great promise in robotics by serving as visual planners or as policy supervisors. When pretrained on internet-scale data, such video models intimately understand alignment with natural language, and can thus facilitate generalization to novel downstream behavior through text-conditioning. However, they may not be sensitive to the specificities of the particular environment the agent inhabits. On the other hand, training video models on in-domain examples of robotic behavior naturally encodes environment-specific intricacies, but the scale of available demonstrations may not be sufficient to support generalization to unseen tasks via natural language specification. In this work, we investigate different adaptation techniques that integrate in-domain information with large-scale pretrained video models, and explore the extent to which they enable novel text-conditioned generalization for robotic tasks, while also considering their independent data and resource considerations. We successfully demonstrate across robotic environments that adapting powerful video models with small scales of example data can successfully facilitate generalization to novel behaviors. In particular, we present a novel adaptation strategy, termed Inverse Probabilistic Adaptation, that not only consistently achieves strong generalization performance across robotic tasks and settings, but also exhibits robustness to the quality of adaptation data, successfully solving novel tasks even when only suboptimal in-domain demonstrations are available.

Solving New Tasks by Adapting Internet Video Knowledge

TL;DR

Abstract

Solving New Tasks by Adapting Internet Video Knowledge

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)