Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs

Zidan Wang; Rui Shen; Bradly Stadie

Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs

Zidan Wang, Rui Shen, Bradly Stadie

TL;DR

Wonderful Team presents a zero-shot, multi-agent Vision-Language Model framework for high-level robotic planning that unifies perception, grounding, and planning within a single VLLM-based loop. By distributing reasoning across specialized agents (Supervisor, Verification, Grounding Manager, Mover, Checker) and maintaining cross-agent memory, the approach enables self-correction and robust long-horizon planning in both simulated and real-world tasks. Empirical results on VIMABench and real-robot tasks show substantial gains over baselines and demonstrate the value of structured reflection and grounding for precise coordinate-level control. The work also contrasts zero-shot VLLM strategies with fine-tuned Vision-Language-Action models, arguing that a modular, hierarchical VLLM framework can achieve strong generalization with fewer task-specific adjustments, albeit with limitations in 3D reasoning and real-time adaptation.

Abstract

We introduce Wonderful Team, a multi-agent Vision Large Language Model (VLLM) framework for executing high-level robotic planning in a zero-shot regime. In our context, zero-shot high-level planning means that for a novel environment, we provide a VLLM with an image of the robot's surroundings and a task description, and the VLLM outputs the sequence of actions necessary for the robot to complete the task. Unlike previous methods for high-level visual planning for robotic manipulation, our method uses VLLMs for the entire planning process, enabling a more tightly integrated loop between perception, control, and planning. As a result, Wonderful Team's performance on real-world semantic and physical planning tasks often exceeds methods that rely on separate vision systems. For example, we see an average 40% success rate improvement on VimaBench over prior methods such as NLaP, an average 30% improvement over Trajectory Generators on tasks from the Trajectory Generator paper, including drawing and wiping a plate, and an average 70% improvement over Trajectory Generators on a new set of semantic reasoning tasks including environment rearrangement with implicit linguistic constraints. We hope these results highlight the rapid improvements of VLLMs in the past year, and motivate the community to consider VLLMs as an option for some high-level robotic planning problems in the future.

Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs

TL;DR

Abstract

Paper Structure (75 sections, 47 figures, 21 tables, 2 algorithms)

This paper contains 75 sections, 47 figures, 21 tables, 2 algorithms.

Introduction
Motivating Examples
Can an LLM Planner with a Separate Vision Model Find Objects?
Can These Issues Be Fixed Easily?
Could simply replacing LangSAM with a VLLM solve these issues?
Wonderful Team
Related Work
Experimental Results
Multimodal Reasoning - Simulated VIMABench
Implicit Goal Inference - Real Robots
Spatial Planning - Real Robots
Results and Discussion
Implicit Goal Inference Tasks
Spatial Planning Tasks
Further Discussions
...and 60 more sections

Figures (47)

Figure 1: Comparison of plans for a color-matching fruit placement task. MOKA's trajectory (b) shows limitations arising from both unverified plans (e.g., step 2 treating the grape as green) and misalignment between visual grounding and VLLM planning (e.g., step 3 misidentifying the orange fruit as the 'orange area' and step 4 failing to pick the green apple due to unspecified color). These errors are often irrecoverable, even with VLLM correction (see Figure \ref{['fig:groundingdino_comparison']}). See Appendix \ref{['sec:appendix_MOKA']} for details on the comparison with MOKA.
Figure 2: Examples of detection failures by specialized segmentation models. Segmentation errors are common across prior methods, regardless of how the model is prompted. See Figure \ref{['fig:groundingdino_comparison']} for a detailed discussion on the extent of this issue and Appendix \ref{['sec:comparisons']} for extensive comparisons with TG and MOKA.
Figure 3: An example of multiple VLLMs working together to recognize and correct an error in object positioning upon review.
Figure 4: A VLLM improving its estimation of the grapes' position over several iterations.
Figure 5: Overview of the major components of the Wonderful Team. Each part of the pipeline receives a different level of input, with a unique scope and specialization within the project. The agents collaboratively handle tasks ranging from high-level planning and logical verification to precise spatial reasoning and memory management, ensuring robust and efficient execution.
...and 42 more figures

Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs

TL;DR

Abstract

Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (47)