Table of Contents
Fetching ...

Solving New Tasks by Adapting Internet Video Knowledge

Calvin Luo, Zilai Zeng, Yilun Du, Chen Sun

TL;DR

Solving New Tasks by Adapting Internet Video Knowledge addresses how to combine internet-scale video priors with limited in-domain demonstrations to achieve novel text-conditioned robotic behaviors. It compares three adaptation strategies—Direct Finetuning, Subject Customization, and Probabilistic Adaptation—and introduces Inverse Probabilistic Adaptation to leverage pretrained priors robustly. Through standardized evaluations on MetaWorld and DeepMind Control, the work demonstrates effective text-conditioned planning and policy supervision, with inverse probabilistic adaptation providing robust generalization even with suboptimal data. The findings present a data-efficient pathway to harness large-scale video priors for embodied AI, enabling broader and more flexible task generalization in real environments.

Abstract

Video generative models demonstrate great promise in robotics by serving as visual planners or as policy supervisors. When pretrained on internet-scale data, such video models intimately understand alignment with natural language, and can thus facilitate generalization to novel downstream behavior through text-conditioning. However, they may not be sensitive to the specificities of the particular environment the agent inhabits. On the other hand, training video models on in-domain examples of robotic behavior naturally encodes environment-specific intricacies, but the scale of available demonstrations may not be sufficient to support generalization to unseen tasks via natural language specification. In this work, we investigate different adaptation techniques that integrate in-domain information with large-scale pretrained video models, and explore the extent to which they enable novel text-conditioned generalization for robotic tasks, while also considering their independent data and resource considerations. We successfully demonstrate across robotic environments that adapting powerful video models with small scales of example data can successfully facilitate generalization to novel behaviors. In particular, we present a novel adaptation strategy, termed Inverse Probabilistic Adaptation, that not only consistently achieves strong generalization performance across robotic tasks and settings, but also exhibits robustness to the quality of adaptation data, successfully solving novel tasks even when only suboptimal in-domain demonstrations are available.

Solving New Tasks by Adapting Internet Video Knowledge

TL;DR

Solving New Tasks by Adapting Internet Video Knowledge addresses how to combine internet-scale video priors with limited in-domain demonstrations to achieve novel text-conditioned robotic behaviors. It compares three adaptation strategies—Direct Finetuning, Subject Customization, and Probabilistic Adaptation—and introduces Inverse Probabilistic Adaptation to leverage pretrained priors robustly. Through standardized evaluations on MetaWorld and DeepMind Control, the work demonstrates effective text-conditioned planning and policy supervision, with inverse probabilistic adaptation providing robust generalization even with suboptimal data. The findings present a data-efficient pathway to harness large-scale video priors for embodied AI, enabling broader and more flexible task generalization in real environments.

Abstract

Video generative models demonstrate great promise in robotics by serving as visual planners or as policy supervisors. When pretrained on internet-scale data, such video models intimately understand alignment with natural language, and can thus facilitate generalization to novel downstream behavior through text-conditioning. However, they may not be sensitive to the specificities of the particular environment the agent inhabits. On the other hand, training video models on in-domain examples of robotic behavior naturally encodes environment-specific intricacies, but the scale of available demonstrations may not be sufficient to support generalization to unseen tasks via natural language specification. In this work, we investigate different adaptation techniques that integrate in-domain information with large-scale pretrained video models, and explore the extent to which they enable novel text-conditioned generalization for robotic tasks, while also considering their independent data and resource considerations. We successfully demonstrate across robotic environments that adapting powerful video models with small scales of example data can successfully facilitate generalization to novel behaviors. In particular, we present a novel adaptation strategy, termed Inverse Probabilistic Adaptation, that not only consistently achieves strong generalization performance across robotic tasks and settings, but also exhibits robustness to the quality of adaptation data, successfully solving novel tasks even when only suboptimal in-domain demonstrations are available.

Paper Structure

This paper contains 25 sections, 6 equations, 5 figures, 18 tables.

Figures (5)

  • Figure 1: Adaptation Techniques. We explore how in-domain information can be integrated into large-scale text-to-video models through three different adaptation techniques: Subject Customization, Probabilistic Adaptation, and Direct Finetuning. Subject Customization only modifies the image and text encoder, rather than the motion module, and is lightweight in terms of data requirements: it only utilizes pairs of static images and text annotated with a special identifier. Probabilistic Adaptation learns a small in-domain model from paired video data, which is then used through score composition with a large-scale video model that is kept frozen. The small in-domain model can be flexibly parameterized to consider available training resources. Direct Finetuning seeks to update the motion module of the large-scale pretrained video model with in-domain paired video data.
  • Figure 2: Downstream Task Evaluation. We identify downstream robotic task performance as a way to achieve standardized, quantitative comparisons across adaptation techniques. We evaluate how adapted video models can enable text-conditioned generalization via two approaches: visual planning and policy supervision. For visual planning, the adapted video model synthesizes a text-conditioned video plan into the future, which is then converted into actions to follow through a separately trained inverse dynamics model. In policy supervision, the adapted video model is used in a discriminative manner to evaluate frames achieved by the policy; these are converted into text-conditioned rewards, which the policy is optimized to maximize.
  • Figure 3: Novel Text-Conditioned Generalization. In the top row, we visualize a free-form video generation from a directly finetuned AnimateDiff model for the novel text prompt "a dog jumping”. This was a behavior unseen during adaptation. When using this adapted video model for policy supervision, we showcase that it can successfully supervise a downstream Dog agent to behave according to novel text specifications in a zero-shot manner (policy rollout shown in bottom row).
  • Figure A1: Continued Denoising. We visualize frames from a task unseen during adaptation, corrupted with a level of Gaussian noise (top row). We then show the result of continued denoising using an inverse probabilistic adaptation model to verify it can visually generalize to fill in novel in-domain information. Despite not having seen a button, it is able to reconstruct it conditioned on text. This figure is for intuition; in practice, a much higher noise level is used, shown in Figure \ref{['fig:mw_continued_denoising_700_unseen']}.
  • Figure A2: Continued Denoising (in practice). In practice, an aggressive level of Gaussian corruption is usually used on achieved frames for reward computation (700 for MetaWorld). However, because to the human eye this may look virtually indistinguishable from pure noise, we supply an illustrative example in Figure \ref{['fig:mw_continued_denoising_400_unseen']} using a noise level of 400. Here, we showcase visuals of the same unseen task corrupted with a practical noise level of 700. We then show the result of continued denoising to visually verify the model integrates adapted in-domain information successfully. When performing continued denoising from such a high corruption, conditioned on the text prompt "a robot arm pushing a button”, it is therefore quite surprising the level of detail with which the adapted text-to-video model is able to reconstruct novel in-domain features such as the button - which it has not even seen during adaptation. The resulting continued denoising video can also be evaluated against in-domain examples via FVD for further insights.