Table of Contents
Fetching ...

ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance

Ying Li, Xiaobao Wei, Xiaowei Chi, Yuming Li, Zhongyu Zhao, Hao Wang, Ningning Ma, Ming Lu, Shanghang Zhang

TL;DR

ManipDreamer tackles the dual challenge of instruction-following accuracy and high-fidelity visual synthesis in robotic video generation. It introduces a structured action-tree representation for instructions, enabling embeddings that capture inter-primitives and temporal dependencies, and couples this with a multi-modal visual guidance adapter that integrates depth and semantic cues into a diffusion-based world model. The method yields substantial improvements in video quality metrics (e.g., higher PSNR, SSIM and lower LPIPS, Flow error) and increases task success rates on RLbench benchmarks, including unseen tasks. This work demonstrates that combining symbolic, hierarchical instruction representations with geometry- and semantics-aware visual conditioning can enhance both planning reliability and realism in robot-centric world models, with practical implications for sim-to-real transfer and scalable manipulation synthesis.

Abstract

While recent advancements in robotic manipulation video synthesis have shown promise, significant challenges persist in ensuring effective instruction-following and achieving high visual quality. Recent methods, like RoboDreamer, utilize linguistic decomposition to divide instructions into separate lower-level primitives, conditioning the world model on these primitives to achieve compositional instruction-following. However, these separate primitives do not consider the relationships that exist between them. Furthermore, recent methods neglect valuable visual guidance, including depth and semantic guidance, both crucial for enhancing visual quality. This paper introduces ManipDreamer, an advanced world model based on the action tree and visual guidance. To better learn the relationships between instruction primitives, we represent the instruction as the action tree and assign embeddings to tree nodes, each instruction can acquire its embeddings by navigating through the action tree. The instruction embeddings can be used to guide the world model. To enhance visual quality, we combine depth and semantic guidance by introducing a visual guidance adapter compatible with the world model. This visual adapter enhances both the temporal and physical consistency of video generation. Based on the action tree and visual guidance, ManipDreamer significantly boosts the instruction-following ability and visual quality. Comprehensive evaluations on robotic manipulation benchmarks reveal that ManipDreamer achieves large improvements in video quality metrics in both seen and unseen tasks, with PSNR improved from 19.55 to 21.05, SSIM improved from 0.7474 to 0.7982 and reduced Flow Error from 3.506 to 3.201 in unseen tasks, compared to the recent RoboDreamer model. Additionally, our method increases the success rate of robotic manipulation tasks by 2.5% in 6 RLbench tasks on average.

ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance

TL;DR

ManipDreamer tackles the dual challenge of instruction-following accuracy and high-fidelity visual synthesis in robotic video generation. It introduces a structured action-tree representation for instructions, enabling embeddings that capture inter-primitives and temporal dependencies, and couples this with a multi-modal visual guidance adapter that integrates depth and semantic cues into a diffusion-based world model. The method yields substantial improvements in video quality metrics (e.g., higher PSNR, SSIM and lower LPIPS, Flow error) and increases task success rates on RLbench benchmarks, including unseen tasks. This work demonstrates that combining symbolic, hierarchical instruction representations with geometry- and semantics-aware visual conditioning can enhance both planning reliability and realism in robot-centric world models, with practical implications for sim-to-real transfer and scalable manipulation synthesis.

Abstract

While recent advancements in robotic manipulation video synthesis have shown promise, significant challenges persist in ensuring effective instruction-following and achieving high visual quality. Recent methods, like RoboDreamer, utilize linguistic decomposition to divide instructions into separate lower-level primitives, conditioning the world model on these primitives to achieve compositional instruction-following. However, these separate primitives do not consider the relationships that exist between them. Furthermore, recent methods neglect valuable visual guidance, including depth and semantic guidance, both crucial for enhancing visual quality. This paper introduces ManipDreamer, an advanced world model based on the action tree and visual guidance. To better learn the relationships between instruction primitives, we represent the instruction as the action tree and assign embeddings to tree nodes, each instruction can acquire its embeddings by navigating through the action tree. The instruction embeddings can be used to guide the world model. To enhance visual quality, we combine depth and semantic guidance by introducing a visual guidance adapter compatible with the world model. This visual adapter enhances both the temporal and physical consistency of video generation. Based on the action tree and visual guidance, ManipDreamer significantly boosts the instruction-following ability and visual quality. Comprehensive evaluations on robotic manipulation benchmarks reveal that ManipDreamer achieves large improvements in video quality metrics in both seen and unseen tasks, with PSNR improved from 19.55 to 21.05, SSIM improved from 0.7474 to 0.7982 and reduced Flow Error from 3.506 to 3.201 in unseen tasks, compared to the recent RoboDreamer model. Additionally, our method increases the success rate of robotic manipulation tasks by 2.5% in 6 RLbench tasks on average.

Paper Structure

This paper contains 29 sections, 6 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of ManipDreamer: a world model for robotic manipulation that integrates structured instruction representations (a) and multi-modal visual guidance (b). We encode language instructions as verb-preposition action trees to capture compositional task structure and inject depth and semantic features through a hierarchical adapter to enhance spatial-temporal consistency in video generation. In the UNet decoder, action tree embeddings and visual guidance features are sequentially injected every 3 layers using cross-attention mechanisms.(c)
  • Figure 2: Different from the instruction decomposition manner in RoboDreamer, we propose a novel action tree method to represent the action to be generated consuming less computation resource.
  • Figure 3: ManipDreamer alleviates multiple common defects in robotic video generation, including instruction misalignment, hallucinations, spatial errors, duplicate objects, temporal discontinuities, and failed executions. This figure presents sample comparisons between RoboDreamer and ManipDreamer, illustrating the effectiveness of our proposed approach.
  • Figure 4: Distribution of average modality weights across layers 1, 4, 7, and 10 in our multi-modal visual adapter. The results demonstrate the relative importance of each modality in different network layers.