Table of Contents
Fetching ...

Grounding Video Models to Actions through Goal Conditioned Exploration

Yunhao Luo, Yilun Du

TL;DR

This work tackles grounding large, Internet-scale video models to actionable continuous control without action labels by enabling goal-directed exploration guided by synthesized video frames. It introduces hindsight relabeling, chunk-level action prediction, and periodic random bootstrapping to train a goal-conditioned policy online, using video-generated frames as goals and a replay buffer for supervision. Across Libero, MetaWorld, Calvin, and iTHOR, the approach achieves competitive or superior performance to action-supervised baselines, highlighting the potential of self-supervised grounding of visual priors for embodied tasks. The results suggest a scalable path toward leveraging generative video priors for manipulation and navigation without costly demonstrations or annotations.

Abstract

Large video models, pretrained on massive amounts of Internet video, provide a rich source of physical knowledge about the dynamics and motions of objects and tasks. However, video models are not grounded in the embodiment of an agent, and do not describe how to actuate the world to reach the visual states depicted in a video. To tackle this problem, current methods use a separate vision-based inverse dynamic model trained on embodiment-specific data to map image states to actions. Gathering data to train such a model is often expensive and challenging, and this model is limited to visual settings similar to the ones in which data are available. In this paper, we investigate how to directly ground video models to continuous actions through self-exploration in the embodied environment -- using generated video states as visual goals for exploration. We propose a framework that uses trajectory level action generation in combination with video guidance to enable an agent to solve complex tasks without any external supervision, e.g., rewards, action labels, or segmentation masks. We validate the proposed approach on 8 tasks in Libero, 6 tasks in MetaWorld, 4 tasks in Calvin, and 12 tasks in iThor Visual Navigation. We show how our approach is on par with or even surpasses multiple behavior cloning baselines trained on expert demonstrations while without requiring any action annotations.

Grounding Video Models to Actions through Goal Conditioned Exploration

TL;DR

This work tackles grounding large, Internet-scale video models to actionable continuous control without action labels by enabling goal-directed exploration guided by synthesized video frames. It introduces hindsight relabeling, chunk-level action prediction, and periodic random bootstrapping to train a goal-conditioned policy online, using video-generated frames as goals and a replay buffer for supervision. Across Libero, MetaWorld, Calvin, and iTHOR, the approach achieves competitive or superior performance to action-supervised baselines, highlighting the potential of self-supervised grounding of visual priors for embodied tasks. The results suggest a scalable path toward leveraging generative video priors for manipulation and navigation without costly demonstrations or annotations.

Abstract

Large video models, pretrained on massive amounts of Internet video, provide a rich source of physical knowledge about the dynamics and motions of objects and tasks. However, video models are not grounded in the embodiment of an agent, and do not describe how to actuate the world to reach the visual states depicted in a video. To tackle this problem, current methods use a separate vision-based inverse dynamic model trained on embodiment-specific data to map image states to actions. Gathering data to train such a model is often expensive and challenging, and this model is limited to visual settings similar to the ones in which data are available. In this paper, we investigate how to directly ground video models to continuous actions through self-exploration in the embodied environment -- using generated video states as visual goals for exploration. We propose a framework that uses trajectory level action generation in combination with video guidance to enable an agent to solve complex tasks without any external supervision, e.g., rewards, action labels, or segmentation masks. We validate the proposed approach on 8 tasks in Libero, 6 tasks in MetaWorld, 4 tasks in Calvin, and 12 tasks in iThor Visual Navigation. We show how our approach is on par with or even surpasses multiple behavior cloning baselines trained on expert demonstrations while without requiring any action annotations.

Paper Structure

This paper contains 53 sections, 3 equations, 22 figures, 16 tables, 1 algorithm.

Figures (22)

  • Figure 1: Grounding Video Models to Actions. Our approach learns to ground a large pretrained video model into continuous actions through goal-directed exploration in the environment. Given a synthesized video, a goal-conditioned policy attempts to reach each visual goal in the video, with data in the resulting real-world execution saved in a replay buffer to train the goal-conditioned policy.
  • Figure 2: Environment Demonstrations. We evaluate our method on three robotic manipulation environments: Libero, Meta-World, Calvin, and one visual navigation environment: iThor, with a total of 30 tasks. Images in (a) and (c) denote the goal object states of a subset of tasks. Images in (b) are randomly sampled start observations of a subset of tasks. In (d), we show the layout for each scene from the agent's view.
  • Figure 3: Qualitative Results of task 'put the yellow and white mug to the right plate' in Libero environment. The start states of the robot and objects are randomized in test time. Only a subset of the predicted video frames are shown due to the space limit. Our goal-conditioned policy shown in the bottom right is able to follow the video prediction and finish the task. BC cannot accurately locate the target while AVDC can move to the mug but without the skill of grasping concave objects.
  • Figure 3: Quantitative Results of 4 Calvin tasks. Each task requires the agent to manipulate objects located in different regions, especially in open drawer where the policy needs to cover the bottom right boundary of the environment.
  • Figure 4: Qualitative Results of task Door-Open in Meta-World environment. The position of the box and robot are randomized in test-time. Only a subset of the predicted video frames are shown due to space limit. Our goal-conditioned policy can follow the subgoals given by the video frames and successfully finish the task. BC misses the handle probably due to the out of training distribution box position and starts to predict random actions. AVDC can move to the handle thanks to the exact given handle location. However, it begins to close the door halfway, probably because of the incorrect flow prediction due to error accumulation or occlusion.
  • ...and 17 more figures