Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs
Zidan Wang, Rui Shen, Bradly Stadie
TL;DR
Wonderful Team presents a zero-shot, multi-agent Vision-Language Model framework for high-level robotic planning that unifies perception, grounding, and planning within a single VLLM-based loop. By distributing reasoning across specialized agents (Supervisor, Verification, Grounding Manager, Mover, Checker) and maintaining cross-agent memory, the approach enables self-correction and robust long-horizon planning in both simulated and real-world tasks. Empirical results on VIMABench and real-robot tasks show substantial gains over baselines and demonstrate the value of structured reflection and grounding for precise coordinate-level control. The work also contrasts zero-shot VLLM strategies with fine-tuned Vision-Language-Action models, arguing that a modular, hierarchical VLLM framework can achieve strong generalization with fewer task-specific adjustments, albeit with limitations in 3D reasoning and real-time adaptation.
Abstract
We introduce Wonderful Team, a multi-agent Vision Large Language Model (VLLM) framework for executing high-level robotic planning in a zero-shot regime. In our context, zero-shot high-level planning means that for a novel environment, we provide a VLLM with an image of the robot's surroundings and a task description, and the VLLM outputs the sequence of actions necessary for the robot to complete the task. Unlike previous methods for high-level visual planning for robotic manipulation, our method uses VLLMs for the entire planning process, enabling a more tightly integrated loop between perception, control, and planning. As a result, Wonderful Team's performance on real-world semantic and physical planning tasks often exceeds methods that rely on separate vision systems. For example, we see an average 40% success rate improvement on VimaBench over prior methods such as NLaP, an average 30% improvement over Trajectory Generators on tasks from the Trajectory Generator paper, including drawing and wiping a plate, and an average 70% improvement over Trajectory Generators on a new set of semantic reasoning tasks including environment rearrangement with implicit linguistic constraints. We hope these results highlight the rapid improvements of VLLMs in the past year, and motivate the community to consider VLLMs as an option for some high-level robotic planning problems in the future.
