Table of Contents
Fetching ...

Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation

Minghao Jin, Mozheng Liao, Mingfei Han, Zhihui Li, Xiaojun Chang

Abstract

Recent world-model-based Vision-Language-Action (VLA) architectures have improved robotic manipulation through predictive visual foresight. However, dense future prediction introduces visual redundancy and accumulates errors, causing long-horizon plan drift. Meanwhile, recent sparse methods typically represent visual foresight using high-level semantic subtasks or implicit latent states. These representations often lack explicit kinematic grounding, weakening the alignment between planning and low-level execution. To address this, we propose StructVLA, which reformulates a generative world model into an explicit structured planner for reliable control. Instead of dense rollouts or semantic goals, StructVLA predicts sparse, physically meaningful structured frames. Derived from intrinsic kinematic cues (e.g., gripper transitions and kinematic turning points), these frames capture spatiotemporal milestones closely aligned with task progress. We implement this approach through a two-stage training paradigm with a unified discrete token vocabulary: the world model is first trained to predict structured frames and subsequently optimized to map the structured foresight into low-level actions. This approach provides clear physical guidance and bridges visual planning and motion control. In our experiments, StructVLA achieves strong average success rates of 75.0% on SimplerEnv-WidowX and 94.8% on LIBERO. Real-world deployments further demonstrate reliable task completion and robust generalization across both basic pick-and-place and complex long-horizon tasks.

Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation

Abstract

Recent world-model-based Vision-Language-Action (VLA) architectures have improved robotic manipulation through predictive visual foresight. However, dense future prediction introduces visual redundancy and accumulates errors, causing long-horizon plan drift. Meanwhile, recent sparse methods typically represent visual foresight using high-level semantic subtasks or implicit latent states. These representations often lack explicit kinematic grounding, weakening the alignment between planning and low-level execution. To address this, we propose StructVLA, which reformulates a generative world model into an explicit structured planner for reliable control. Instead of dense rollouts or semantic goals, StructVLA predicts sparse, physically meaningful structured frames. Derived from intrinsic kinematic cues (e.g., gripper transitions and kinematic turning points), these frames capture spatiotemporal milestones closely aligned with task progress. We implement this approach through a two-stage training paradigm with a unified discrete token vocabulary: the world model is first trained to predict structured frames and subsequently optimized to map the structured foresight into low-level actions. This approach provides clear physical guidance and bridges visual planning and motion control. In our experiments, StructVLA achieves strong average success rates of 75.0% on SimplerEnv-WidowX and 94.8% on LIBERO. Real-world deployments further demonstrate reliable task completion and robust generalization across both basic pick-and-place and complex long-horizon tasks.
Paper Structure (23 sections, 4 equations, 10 figures, 4 tables)

This paper contains 23 sections, 4 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Illustration of StructVLA (Structured Planner for Vision-Language-Action). Existing generative VLA methods either predict dense future observations (bottom left), which accumulate errors and cause long-horizon plan drift, or rely on semantic or latent planning (bottom right) that lacks explicit geometric grounding and leads to physical misalignment. Our StructVLA (top) learns a physically grounded structured planner by training a world model to predict sparse structured frames, and transfers this representation to action control. This design enables stable long-horizon manipulation and demonstrates strong generalization and robustness across simulation and real-world benchmarks.
  • Figure 2: Overview of StructVLA. StructVLA is trained in two stages. Stage 1 (Structured Planner): an autoregressive world model predicts sparse structured frames that capture physically grounded progress anchors, conditioned on the instruction and visual context. Stage 2 (Action Policy): we fine-tune the structured planner for control by conditioning on the instruction together with interleaved observation and actions, transferring structured planning into low-level control.
  • Figure 3: Qualitative comparison of visual predictions. Our structured planner generates coherent, long-horizon visual foresight, supporting robust planning, whereas the baseline world model struggles with long-range comprehension, producing short-horizon predictions with degraded image quality. Red boxes highlight key differences.
  • Figure 4: More visualizations (a) Attention maps: The planner (left) localizes task-critical interactions (, gripper--object contact), while the baseline (right) is diffuse. (b) OOD predictions: StructVLA retains world-model priors, enabling zero-shot visual planning on unseen scenes and tasks. (c) Real-world deployment setup.
  • Figure 5: Quantitative Results on Real-World Deployments. Success rates on diverse physical manipulation tasks (10 trials per task). StructVLA matches or exceeds prior baselines on the foundation tasks and remains strong on the challenging long-horizon tidy-up task.
  • ...and 5 more figures