Table of Contents
Fetching ...

Embodied Tree of Thoughts: Deliberate Manipulation Planning with Embodied World Model

Wenjiang Xu, Cindy Wang, Rui Fang, Mingkang Zhang, Lusong Li, Jing Xu, Jiayuan Gu, Zecui Zeng, Rui Chen

TL;DR

The paper addresses long-horizon manipulation planning under physical constraints by embedding Vision-Language Model reasoning within a physics-based embodied world model. It introduces Embodied Tree of Thoughts (EToT), which uses Priori Branching to generate diverse plan branches and Reflective Branching to diagnose and revise failures through simulator feedback, all within a closed-loop system. Experimental results across seven tabletop tasks show that EToT outperforms baselines and that ablations confirm the necessity of physics grounding, planning-tree structure, and replanning under disturbances. The work demonstrates significant improvements in physical consistency, failure resilience, and planning efficiency, highlighting practical impact for robust robotic manipulation in open-world settings.

Abstract

World models have emerged as a pivotal component in robot manipulation planning, enabling agents to predict future environmental states and reason about the consequences of actions before execution. While video-generation models are increasingly adopted, they often lack rigorous physical grounding, leading to hallucinations and a failure to maintain consistency in long-horizon physical constraints. To address these limitations, we propose Embodied Tree of Thoughts (EToT), a novel Real2Sim2Real planning framework that leverages a physics-based interactive digital twin as an embodied world model. EToT formulates manipulation planning as a tree search expanded through two synergistic mechanisms: (1) Priori Branching, which generates diverse candidate execution paths based on semantic and spatial analysis; and (2) Reflective Branching, which utilizes VLMs to diagnose execution failures within the simulator and iteratively refine the planning tree with corrective actions. By grounding high-level reasoning in a physics simulator, our framework ensures that generated plans adhere to rigid-body dynamics and collision constraints. We validate EToT on a suite of short- and long-horizon manipulation tasks, where it consistently outperforms baselines by effectively predicting physical dynamics and adapting to potential failures. Website at https://embodied-tree-of-thoughts.github.io .

Embodied Tree of Thoughts: Deliberate Manipulation Planning with Embodied World Model

TL;DR

The paper addresses long-horizon manipulation planning under physical constraints by embedding Vision-Language Model reasoning within a physics-based embodied world model. It introduces Embodied Tree of Thoughts (EToT), which uses Priori Branching to generate diverse plan branches and Reflective Branching to diagnose and revise failures through simulator feedback, all within a closed-loop system. Experimental results across seven tabletop tasks show that EToT outperforms baselines and that ablations confirm the necessity of physics grounding, planning-tree structure, and replanning under disturbances. The work demonstrates significant improvements in physical consistency, failure resilience, and planning efficiency, highlighting practical impact for robust robotic manipulation in open-world settings.

Abstract

World models have emerged as a pivotal component in robot manipulation planning, enabling agents to predict future environmental states and reason about the consequences of actions before execution. While video-generation models are increasingly adopted, they often lack rigorous physical grounding, leading to hallucinations and a failure to maintain consistency in long-horizon physical constraints. To address these limitations, we propose Embodied Tree of Thoughts (EToT), a novel Real2Sim2Real planning framework that leverages a physics-based interactive digital twin as an embodied world model. EToT formulates manipulation planning as a tree search expanded through two synergistic mechanisms: (1) Priori Branching, which generates diverse candidate execution paths based on semantic and spatial analysis; and (2) Reflective Branching, which utilizes VLMs to diagnose execution failures within the simulator and iteratively refine the planning tree with corrective actions. By grounding high-level reasoning in a physics simulator, our framework ensures that generated plans adhere to rigid-body dynamics and collision constraints. We validate EToT on a suite of short- and long-horizon manipulation tasks, where it consistently outperforms baselines by effectively predicting physical dynamics and adapting to potential failures. Website at https://embodied-tree-of-thoughts.github.io .

Paper Structure

This paper contains 29 sections, 1 equation, 12 figures, 2 tables, 1 algorithm.

Figures (12)

  • Figure 1: Overview of the Embodied Tree of Thoughts (EToT) framework. Given a task instruction, the system first reconstructs the real scene into an interactive 3D digital twin (Sec. \ref{['sec:Real2Sim']}). It then constructs a world-model-grounded planning tree through Priori Branching and Reflective Branching (Sec. \ref{['sec:plan-tree']}). Priori Branching proposes initial candidate branches, while Reflective Branching analyzes simulated execution failures to expand the tree with revised branches. Through iterative searching and expansion of the planning tree, the system identifies a feasible plan, which is finally executed on the real robot in a closed-loop manner with visual feedback and re-planning (Sec. \ref{['sec:closed-loop']}).
  • Figure 2: Experimental scenarios in real world and simulation.
  • Figure 3: Schematic diagrams of all tasks, including the initial states and the correct goal states, with boxes and arrows indicating the position changes of key objects. We provide detailed plan trees in the appendix(Fig. \ref{['fig:tasks1-4']}, \ref{['fig:tasksd5-7']}).
  • Figure 4: Comparison of scene evolution in the real world, the physics-based simulator, and the video generation model (VGM) for the action "Put the pen into holder 2" in Task 5
  • Figure 5: Failure analysis. Representative examples and the percentage of each failure type are shown.
  • ...and 7 more figures