Table of Contents
Fetching ...

Scaling World Model for Hierarchical Manipulation Policies

Qian Long, Yueze Wang, Jiaxi Song, Junbo Zhang, Peiyan Li, Wenxuan Wang, Yuqi Wang, Haoyang Li, Shaoxuan Xie, Guocai Yao, Hanbo Zhang, Xinlong Wang, Zhongyuan Wang, Xuguang Lan, Huaping Liu, Xinghang Li

TL;DR

VISTA presents a scalable hierarchical framework that marries an embodied world model with a visual-goal, text-conditioned low-level policy to address the brittleness of Vision-Language-Action models in out-of-distribution robotic manipulation. The world model autonomously generates interleaved sequences of textual subtasks and multi-view visual subgoals, which ground high-level planning in physically plausible visual guidance; the GoalVLA then executes action chunks conditioned on both the textual and visual cues. With only 2 hours of real-robot data across 5 objects, VISTA achieves 69% success on 21 unseen objects and tasks, vastly improving over a language-only baseline at 14%, demonstrating notable data efficiency and cross-embodiment generalization. The approach leverages large-scale embodied datasets, automated milestone labeling, and a large Any-to-Image pretraining corpus to enable robust, multi-view, spatially grounded manipulation, setting a new direction for scalable, generalizable robot learning.

Abstract

Vision-Language-Action (VLA) models are promising for generalist robot manipulation but remain brittle in out-of-distribution (OOD) settings, especially with limited real-robot data. To resolve the generalization bottleneck, we introduce a hierarchical Vision-Language-Action framework \our{} that leverages the generalization of large-scale pre-trained world model for robust and generalizable VIsual Subgoal TAsk decomposition VISTA. Our hierarchical framework \our{} consists of a world model as the high-level planner and a VLA as the low-level executor. The high-level world model first divides manipulation tasks into subtask sequences with goal images, and the low-level policy follows the textual and visual guidance to generate action sequences. Compared to raw textual goal specification, these synthesized goal images provide visually and physically grounded details for low-level policies, making it feasible to generalize across unseen objects and novel scenarios. We validate both visual goal synthesis and our hierarchical VLA policies in massive out-of-distribution scenarios, and the performance of the same-structured VLA in novel scenarios could boost from 14% to 69% with the guidance generated by the world model. Results demonstrate that our method outperforms previous baselines with a clear margin, particularly in out-of-distribution scenarios. Project page: \href{https://vista-wm.github.io/}{https://vista-wm.github.io}

Scaling World Model for Hierarchical Manipulation Policies

TL;DR

VISTA presents a scalable hierarchical framework that marries an embodied world model with a visual-goal, text-conditioned low-level policy to address the brittleness of Vision-Language-Action models in out-of-distribution robotic manipulation. The world model autonomously generates interleaved sequences of textual subtasks and multi-view visual subgoals, which ground high-level planning in physically plausible visual guidance; the GoalVLA then executes action chunks conditioned on both the textual and visual cues. With only 2 hours of real-robot data across 5 objects, VISTA achieves 69% success on 21 unseen objects and tasks, vastly improving over a language-only baseline at 14%, demonstrating notable data efficiency and cross-embodiment generalization. The approach leverages large-scale embodied datasets, automated milestone labeling, and a large Any-to-Image pretraining corpus to enable robust, multi-view, spatially grounded manipulation, setting a new direction for scalable, generalizable robot learning.

Abstract

Vision-Language-Action (VLA) models are promising for generalist robot manipulation but remain brittle in out-of-distribution (OOD) settings, especially with limited real-robot data. To resolve the generalization bottleneck, we introduce a hierarchical Vision-Language-Action framework \our{} that leverages the generalization of large-scale pre-trained world model for robust and generalizable VIsual Subgoal TAsk decomposition VISTA. Our hierarchical framework \our{} consists of a world model as the high-level planner and a VLA as the low-level executor. The high-level world model first divides manipulation tasks into subtask sequences with goal images, and the low-level policy follows the textual and visual guidance to generate action sequences. Compared to raw textual goal specification, these synthesized goal images provide visually and physically grounded details for low-level policies, making it feasible to generalize across unseen objects and novel scenarios. We validate both visual goal synthesis and our hierarchical VLA policies in massive out-of-distribution scenarios, and the performance of the same-structured VLA in novel scenarios could boost from 14% to 69% with the guidance generated by the world model. Results demonstrate that our method outperforms previous baselines with a clear margin, particularly in out-of-distribution scenarios. Project page: \href{https://vista-wm.github.io/}{https://vista-wm.github.io}
Paper Structure (43 sections, 5 equations, 23 figures, 2 tables)

This paper contains 43 sections, 5 equations, 23 figures, 2 tables.

Figures (23)

  • Figure 1: Overview of the VISTA. The framework comprises two essential modules: Left: VISTA serves as a high-level planner. By treating visual goals and textual subtasks as a unified multi-modal sequence, it autoregressively generates interleaved textual subtasks and visual goals conditioned on the global instruction and initial observation. Right: The GoalVLA acts as the low-level controller. It takes the real-time observation and the generated subgoal as input to predict executable action chunks. The execution process is managed hierarchically, where a subtask switcher transitions to the next stage upon completion of the current stage.
  • Figure 2: Visualization of samples from our constructed embodied dataset. The dataset represents manipulation tasks as interleaved sequences of textual subtask and corresponding visual goals.
  • Figure 3: Visualization of real-world experiment setups, including the in-domain and novel scenarios.
  • Figure 4: The qualitative results for the subtask and goal image sequences generated by VISTA. See Appendix \ref{['sec:app_vis_unconstrained']} for more results.
  • Figure 5: The visualization of the multi-view goal images generated by VISTA in the real robot workspace with novel scenarios (unseen layout, distractor, target object, and background).
  • ...and 18 more figures