Table of Contents
Fetching ...

RoboDreamer: Learning Compositional World Models for Robot Imagination

Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, Chuang Gan

TL;DR

<3-5 sentence high-level summary> RoboDreamer addresses generalization gaps in text-to-video models for robotics by factorizing generation into language-driven primitives and multimodal cues. It introduces a text parser to extract action verbs and spatial relations and trains a set of conditioned diffusion models that can be recombined to handle unseen instructions. The approach extends to multimodal goals, enabling goal images and sketches to refine generated plans. Experiments on RT-1 and RLBench demonstrate improved zero-shot generalization, better spatial accuracy, and practical robot planning capabilities compared to baselines.

Abstract

Text-to-video models have demonstrated substantial potential in robotic decision-making, enabling the imagination of realistic plans of future actions as well as accurate environment simulation. However, one major issue in such models is generalization -- models are limited to synthesizing videos subject to language instructions similar to those seen at training time. This is heavily limiting in decision-making, where we seek a powerful world model to synthesize plans of unseen combinations of objects and actions in order to solve previously unseen tasks in new environments. To resolve this issue, we introduce RoboDreamer, an innovative approach for learning a compositional world model by factorizing the video generation. We leverage the natural compositionality of language to parse instructions into a set of lower-level primitives, which we condition a set of models on to generate videos. We illustrate how this factorization naturally enables compositional generalization, by allowing us to formulate a new natural language instruction as a combination of previously seen components. We further show how such a factorization enables us to add additional multimodal goals, allowing us to specify a video we wish to generate given both natural language instructions and a goal image. Our approach can successfully synthesize video plans on unseen goals in the RT-X, enables successful robot execution in simulation, and substantially outperforms monolithic baseline approaches to video generation.

RoboDreamer: Learning Compositional World Models for Robot Imagination

TL;DR

<3-5 sentence high-level summary> RoboDreamer addresses generalization gaps in text-to-video models for robotics by factorizing generation into language-driven primitives and multimodal cues. It introduces a text parser to extract action verbs and spatial relations and trains a set of conditioned diffusion models that can be recombined to handle unseen instructions. The approach extends to multimodal goals, enabling goal images and sketches to refine generated plans. Experiments on RT-1 and RLBench demonstrate improved zero-shot generalization, better spatial accuracy, and practical robot planning capabilities compared to baselines.

Abstract

Text-to-video models have demonstrated substantial potential in robotic decision-making, enabling the imagination of realistic plans of future actions as well as accurate environment simulation. However, one major issue in such models is generalization -- models are limited to synthesizing videos subject to language instructions similar to those seen at training time. This is heavily limiting in decision-making, where we seek a powerful world model to synthesize plans of unseen combinations of objects and actions in order to solve previously unseen tasks in new environments. To resolve this issue, we introduce RoboDreamer, an innovative approach for learning a compositional world model by factorizing the video generation. We leverage the natural compositionality of language to parse instructions into a set of lower-level primitives, which we condition a set of models on to generate videos. We illustrate how this factorization naturally enables compositional generalization, by allowing us to formulate a new natural language instruction as a combination of previously seen components. We further show how such a factorization enables us to add additional multimodal goals, allowing us to specify a video we wish to generate given both natural language instructions and a goal image. Our approach can successfully synthesize video plans on unseen goals in the RT-X, enables successful robot execution in simulation, and substantially outperforms monolithic baseline approaches to video generation.
Paper Structure (20 sections, 6 equations, 7 figures, 4 tables, 2 algorithms)

This paper contains 20 sections, 6 equations, 7 figures, 4 tables, 2 algorithms.

Figures (7)

  • Figure 1: Compositional Action Specification. When existing text-to-video models (AVDC ko2023learning) are given unusual combinations of language instructions, they are unable to synthesize videos that align accurately with these descriptions. RoboDreamer factorizes the generation compositionally, enabling generalization to novel combinations of language.
  • Figure 2: Compositional World Models. Given language instructions and multimodal instructions such as goal images and sketches, our approach factorizes the generation into a composition of diffusion models conditioned on inferred components. This enables our approach to generalize to both new combinations of language and multimodal input.
  • Figure 3: Overall framework of RoboDreamer. On the left, We leverage the natural compositionally of language to parse instructions into components like action phrases and relation phrases. On the right, we show how RoboDreamer composes multiple components.
  • Figure 4: Zero-Shot Video Generation. Given novel combinations of natural language, RoboDreamer is able to substantially more accurately synthesize videos than a single monolithic text-to-video model.
  • Figure 5: Multimodal Compositionality. RoboDreamer is able to compose multimodal inputs such as goal and sketch image conditioning with language instructions and synthesize plausible videos.
  • ...and 2 more figures