Table of Contents
Fetching ...

VideoAgent: Self-Improving Video Generation

Achint Soni, Sreyas Venkataraman, Abhranil Chandra, Sebastian Fischmeister, Percy Liang, Bo Dai, Sherry Yang

TL;DR

VideoAgent tackles hallucinations and unrealistic physics in text-to-video planning for embodied tasks by grounding video generation in external feedback and ongoing data collection. It introduces self-conditioning consistency to iteratively refine video plans, and uses a vision-language model to guide inference and stopping criteria, with online finetuning to improve future generations. Across Meta-World, iTHOR, and BridgeData-V2, VideoAgent significantly reduces hallucinations and boosts downstream task success compared to baselines, with online refinements delivering the largest gains. The approach demonstrates a practical pathway to grounding video-based policies in real-world dynamics, enabling more reliable video-to-action control for robotics and broader visual-policy learning applications.

Abstract

Video generation has been used to generate visual plans for controlling robotic systems. Given an image observation and a language instruction, previous work has generated video plans which are then converted to robot controls to be executed. However, a major bottleneck in leveraging video generation for control lies in the quality of the generated videos, which often suffer from hallucinatory content and unrealistic physics, resulting in low task success when control actions are extracted from the generated videos. While scaling up dataset and model size provides a partial solution, integrating external feedback is both natural and essential for grounding video generation in the real world. With this observation, we propose VideoAgent for self-improving generated video plans based on external feedback. Instead of directly executing the generated video plan, VideoAgent first refines the generated video plans using a novel procedure which we call self-conditioning consistency, allowing inference-time compute to be turned into better generated video plans. As the refined video plan is being executed, VideoAgent can collect additional data from the environment to further improve video plan generation. Experiments in simulated robotic manipulation from MetaWorld and iTHOR show that VideoAgent drastically reduces hallucination, thereby boosting success rate of downstream manipulation tasks. We further illustrate that VideoAgent can effectively refine real-robot videos, providing an early indicator that robots can be an effective tool in grounding video generation in the physical world. Video demos and code can be found at https://video-as-agent.github.io.

VideoAgent: Self-Improving Video Generation

TL;DR

VideoAgent tackles hallucinations and unrealistic physics in text-to-video planning for embodied tasks by grounding video generation in external feedback and ongoing data collection. It introduces self-conditioning consistency to iteratively refine video plans, and uses a vision-language model to guide inference and stopping criteria, with online finetuning to improve future generations. Across Meta-World, iTHOR, and BridgeData-V2, VideoAgent significantly reduces hallucinations and boosts downstream task success compared to baselines, with online refinements delivering the largest gains. The approach demonstrates a practical pathway to grounding video-based policies in real-world dynamics, enabling more reliable video-to-action control for robotics and broader visual-policy learning applications.

Abstract

Video generation has been used to generate visual plans for controlling robotic systems. Given an image observation and a language instruction, previous work has generated video plans which are then converted to robot controls to be executed. However, a major bottleneck in leveraging video generation for control lies in the quality of the generated videos, which often suffer from hallucinatory content and unrealistic physics, resulting in low task success when control actions are extracted from the generated videos. While scaling up dataset and model size provides a partial solution, integrating external feedback is both natural and essential for grounding video generation in the real world. With this observation, we propose VideoAgent for self-improving generated video plans based on external feedback. Instead of directly executing the generated video plan, VideoAgent first refines the generated video plans using a novel procedure which we call self-conditioning consistency, allowing inference-time compute to be turned into better generated video plans. As the refined video plan is being executed, VideoAgent can collect additional data from the environment to further improve video plan generation. Experiments in simulated robotic manipulation from MetaWorld and iTHOR show that VideoAgent drastically reduces hallucination, thereby boosting success rate of downstream manipulation tasks. We further illustrate that VideoAgent can effectively refine real-robot videos, providing an early indicator that robots can be an effective tool in grounding video generation in the physical world. Video demos and code can be found at https://video-as-agent.github.io.

Paper Structure

This paper contains 49 sections, 15 equations, 11 figures, 9 tables, 3 algorithms.

Figures (11)

  • Figure 1: The VideoAgent Framework. VideoAgent first generates a video plan conditioned on an image observation and task description similar to du2023video, and undergoes (1) iterative video refinement using feedback from a vision language model (VLM), (2) using the VLM to select the best refined video plan to convert to control actions through optical flow, and (3) executing the control actions in an environment and improving video generation using real-world feedback and additional data collected online.
  • Figure 2: An illustration of Self-Conditioning Consistency. The horizontal direction represents the regular denoising process. The two rows represent two refinement iterations. $\hat{\mathbf{x}}_i$ denotes the generated video plan at refinement iteration $i$. We condition the refinement iteration $i+1$ on the generated video from the previous iteration $\hat{\mathbf{x}}_{i}$.
  • Figure 3: Effect of Refinement Iterations. The accuracy of downstream tasks generally increases as the number of refinement iteration increases.
  • Figure 4: Effect of Online Iterations. The overall task success of VideoAgent increases as the number of online iterations increases.
  • Figure 5: Correcting Hallucinations in Video Generation: The AVDC model hallucinates after the second frame, removing the colander and placing the banana on the table. In contrast, VideoAgent accurately retains the colander's position and correctly places the banana inside.
  • ...and 6 more figures