Table of Contents
Fetching ...

RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation

Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Soumajit Majumder, Ziyuan Liu, Gitta Kutyniok, Abhinav Valada

TL;DR

RoboEnvision tackles long-horizon robotic video generation conditioned on a high-level instruction by introducing a two-stage, non-autoregressive diffusion framework. It uses a Vision-Language Model to decompose goals into atomic tasks, a Keyframe Diffusion Model with cross-attention and a Semantics Preserving Attention to generate aligned keyframes, and a Filling Diffusion model to interpolate between them, followed by a lightweight Transformer-based policy to infer robot joints. The approach achieves state-of-the-art video quality and consistency on LHMM and LanguageTable benchmarks and substantially improves long-horizon task success compared to baselines. This pipeline enables effective data augmentation and planning for long-horizon robotic manipulation, with potential extensions to depth and semantic conditioning for improved physical alignment.

Abstract

We address the problem of generating long-horizon videos for robotic manipulation tasks. Text-to-video diffusion models have made significant progress in photorealism, language understanding, and motion generation but struggle with long-horizon robotic tasks. Recent works use video diffusion models for high-quality simulation data and predictive rollouts in robot planning. However, these works predict short sequences of the robot achieving one task and employ an autoregressive paradigm to extend to the long horizon, leading to error accumulations in the generated video and in the execution. To overcome these limitations, we propose a novel pipeline that bypasses the need for autoregressive generation. We achieve this through a threefold contribution: 1) we first decompose the high-level goals into smaller atomic tasks and generate keyframes aligned with these instructions. A second diffusion model then interpolates between each of the two generated frames, achieving the long-horizon video. 2) We propose a semantics preserving attention module to maintain consistency between the keyframes. 3) We design a lightweight policy model to regress the robot joint states from generated videos. Our approach achieves state-of-the-art results on two benchmarks in video quality and consistency while outperforming previous policy models on long-horizon tasks.

RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation

TL;DR

RoboEnvision tackles long-horizon robotic video generation conditioned on a high-level instruction by introducing a two-stage, non-autoregressive diffusion framework. It uses a Vision-Language Model to decompose goals into atomic tasks, a Keyframe Diffusion Model with cross-attention and a Semantics Preserving Attention to generate aligned keyframes, and a Filling Diffusion model to interpolate between them, followed by a lightweight Transformer-based policy to infer robot joints. The approach achieves state-of-the-art video quality and consistency on LHMM and LanguageTable benchmarks and substantially improves long-horizon task success compared to baselines. This pipeline enables effective data augmentation and planning for long-horizon robotic manipulation, with potential extensions to depth and semantic conditioning for improved physical alignment.

Abstract

We address the problem of generating long-horizon videos for robotic manipulation tasks. Text-to-video diffusion models have made significant progress in photorealism, language understanding, and motion generation but struggle with long-horizon robotic tasks. Recent works use video diffusion models for high-quality simulation data and predictive rollouts in robot planning. However, these works predict short sequences of the robot achieving one task and employ an autoregressive paradigm to extend to the long horizon, leading to error accumulations in the generated video and in the execution. To overcome these limitations, we propose a novel pipeline that bypasses the need for autoregressive generation. We achieve this through a threefold contribution: 1) we first decompose the high-level goals into smaller atomic tasks and generate keyframes aligned with these instructions. A second diffusion model then interpolates between each of the two generated frames, achieving the long-horizon video. 2) We propose a semantics preserving attention module to maintain consistency between the keyframes. 3) We design a lightweight policy model to regress the robot joint states from generated videos. Our approach achieves state-of-the-art results on two benchmarks in video quality and consistency while outperforming previous policy models on long-horizon tasks.

Paper Structure

This paper contains 16 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Top: Previous Works unipiavdc predict short-horizon videos and estimate robot actions from them. Long horizon tasks are executed by cascading this approach sequentially along the time axis. Bottom: Our RoboEnvision model, breaks down a high-level instruction into small atomic instructions with a VLM, generates a frame aligned with each one, and interpolates between them. A policy model estimates the robot joints based on the keyframes and a few interpolated frames in between.
  • Figure 2: RoboEnvision generates keyframes aligned with short-horzion instructions (Stage 1) and interpolates between them (Stage 2). We show: (A) the architecture of Stage 1 or the keyframe diffusion, (B) the mask used in Keyframe-Instruction Cross-Attention, (C) the design of the Semantic Preserving Attention module to enforce consistency, and (D) The Policy Model that regresses the robot joint angles from the generated frames.
  • Figure 3: Qualitative results comparing our method with baselines on the LanguageTable and LHMM datasets.
  • Figure 4: Visualization of long-horizon video generation based on instructions with different execution orders.
  • Figure 5: Qualitative results of long-horizon planning using GPT4-o1.
  • ...and 1 more figures