Scaling World Model for Hierarchical Manipulation Policies
Qian Long, Yueze Wang, Jiaxi Song, Junbo Zhang, Peiyan Li, Wenxuan Wang, Yuqi Wang, Haoyang Li, Shaoxuan Xie, Guocai Yao, Hanbo Zhang, Xinlong Wang, Zhongyuan Wang, Xuguang Lan, Huaping Liu, Xinghang Li
TL;DR
VISTA presents a scalable hierarchical framework that marries an embodied world model with a visual-goal, text-conditioned low-level policy to address the brittleness of Vision-Language-Action models in out-of-distribution robotic manipulation. The world model autonomously generates interleaved sequences of textual subtasks and multi-view visual subgoals, which ground high-level planning in physically plausible visual guidance; the GoalVLA then executes action chunks conditioned on both the textual and visual cues. With only 2 hours of real-robot data across 5 objects, VISTA achieves 69% success on 21 unseen objects and tasks, vastly improving over a language-only baseline at 14%, demonstrating notable data efficiency and cross-embodiment generalization. The approach leverages large-scale embodied datasets, automated milestone labeling, and a large Any-to-Image pretraining corpus to enable robust, multi-view, spatially grounded manipulation, setting a new direction for scalable, generalizable robot learning.
Abstract
Vision-Language-Action (VLA) models are promising for generalist robot manipulation but remain brittle in out-of-distribution (OOD) settings, especially with limited real-robot data. To resolve the generalization bottleneck, we introduce a hierarchical Vision-Language-Action framework \our{} that leverages the generalization of large-scale pre-trained world model for robust and generalizable VIsual Subgoal TAsk decomposition VISTA. Our hierarchical framework \our{} consists of a world model as the high-level planner and a VLA as the low-level executor. The high-level world model first divides manipulation tasks into subtask sequences with goal images, and the low-level policy follows the textual and visual guidance to generate action sequences. Compared to raw textual goal specification, these synthesized goal images provide visually and physically grounded details for low-level policies, making it feasible to generalize across unseen objects and novel scenarios. We validate both visual goal synthesis and our hierarchical VLA policies in massive out-of-distribution scenarios, and the performance of the same-structured VLA in novel scenarios could boost from 14% to 69% with the guidance generated by the world model. Results demonstrate that our method outperforms previous baselines with a clear margin, particularly in out-of-distribution scenarios. Project page: \href{https://vista-wm.github.io/}{https://vista-wm.github.io}
