Table of Contents
Fetching ...

Goal-VLA: Image-Generative VLMs as Object-Centric World Models Empowering Zero-shot Robot Manipulation

Haonan Chen, Jingxiang Guo, Bangjun Wang, Tianrui Zhang, Xuchuan Huang, Boren Zheng, Yiwen Hou, Chenrui Tie, Jiajun Deng, Lin Shao

TL;DR

Goal-VLA addresses robust zero-shot generalization in robotic manipulation by using an image-generative Vision-Language Model (VLM) as an object-centric world model to synthesize a goal state, from which a precise 3D pose is derived for low-level control. The framework decouples high-level semantic reasoning from spatial grounding via Goal State Reasoning, Spatial Grounding, and a training-free Low-level Policy, and introduces Reflection-through-Synthesis to iteratively validate and refine the goal image. Empirical results in RLBench simulation (eight tasks) and four real-world tasks show substantial gains over state-of-the-art end-to-end and hierarchical baselines, with significant improvements from input enhancement and the reflection loop. This work demonstrates effective zero-shot manipulation across diverse tasks, environments, object categories, and robot embodiments, highlighting a practical path for cross-embodiment generalization using foundation-model-driven world models. Overall, Goal-VLA provides a training-free bridge between semantic goal generation and precise spatial execution, enabling scalable deployment of robotic manipulation in unstructured settings.

Abstract

Generalization remains a fundamental challenge in robotic manipulation. To tackle this challenge, recent Vision-Language-Action (VLA) models build policies on top of Vision-Language Models (VLMs), seeking to transfer their open-world semantic knowledge. However, their zero-shot capability lags significantly behind the base VLMs, as the instruction-vision-action data is too limited to cover diverse scenarios, tasks, and robot embodiments. In this work, we present Goal-VLA, a zero-shot framework that leverages Image-Generative VLMs as world models to generate desired goal states, from which the target object pose is derived to enable generalizable manipulation. The key insight is that object state representation is the golden interface, naturally separating a manipulation system into high-level and low-level policies. This representation abstracts away explicit action annotations, allowing the use of highly generalizable VLMs while simultaneously providing spatial cues for training-free low-level control. To further improve robustness, we introduce a Reflection-through-Synthesis process that iteratively validates and refines the generated goal image before execution. Both simulated and real-world experiments demonstrate that our \name achieves strong performance and inspiring generalizability in manipulation tasks. Supplementary materials are available at https://nus-lins-lab.github.io/goalvlaweb/.

Goal-VLA: Image-Generative VLMs as Object-Centric World Models Empowering Zero-shot Robot Manipulation

TL;DR

Goal-VLA addresses robust zero-shot generalization in robotic manipulation by using an image-generative Vision-Language Model (VLM) as an object-centric world model to synthesize a goal state, from which a precise 3D pose is derived for low-level control. The framework decouples high-level semantic reasoning from spatial grounding via Goal State Reasoning, Spatial Grounding, and a training-free Low-level Policy, and introduces Reflection-through-Synthesis to iteratively validate and refine the goal image. Empirical results in RLBench simulation (eight tasks) and four real-world tasks show substantial gains over state-of-the-art end-to-end and hierarchical baselines, with significant improvements from input enhancement and the reflection loop. This work demonstrates effective zero-shot manipulation across diverse tasks, environments, object categories, and robot embodiments, highlighting a practical path for cross-embodiment generalization using foundation-model-driven world models. Overall, Goal-VLA provides a training-free bridge between semantic goal generation and precise spatial execution, enabling scalable deployment of robotic manipulation in unstructured settings.

Abstract

Generalization remains a fundamental challenge in robotic manipulation. To tackle this challenge, recent Vision-Language-Action (VLA) models build policies on top of Vision-Language Models (VLMs), seeking to transfer their open-world semantic knowledge. However, their zero-shot capability lags significantly behind the base VLMs, as the instruction-vision-action data is too limited to cover diverse scenarios, tasks, and robot embodiments. In this work, we present Goal-VLA, a zero-shot framework that leverages Image-Generative VLMs as world models to generate desired goal states, from which the target object pose is derived to enable generalizable manipulation. The key insight is that object state representation is the golden interface, naturally separating a manipulation system into high-level and low-level policies. This representation abstracts away explicit action annotations, allowing the use of highly generalizable VLMs while simultaneously providing spatial cues for training-free low-level control. To further improve robustness, we introduce a Reflection-through-Synthesis process that iteratively validates and refines the generated goal image before execution. Both simulated and real-world experiments demonstrate that our \name achieves strong performance and inspiring generalizability in manipulation tasks. Supplementary materials are available at https://nus-lins-lab.github.io/goalvlaweb/.

Paper Structure

This paper contains 20 sections, 4 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Goal-VLA maps a single-view RGB-D image and a language instruction to executable manipulation actions. Our approach employs an object-centric world model to generate a goal image, from which the corresponding object transformation is subsequently computed, enabling zero-shot manipulation. This overcomes key limitations of previous methods, which often either rely on paired action data for training or lack precise spatial reasoning. We demonstrate our framework's strong performance and generalization capabilities across both simulated and real-world experiments.
  • Figure 2: Overview of the Goal-VLA framework, which decouples the manipulation pipeline into three stages: (a) Goal State Reasoning: A VLM generates a goal image from instructions and refines it for task feasibility, yielding a validated goal with image, mask, and depth. (b) Spatial Grounding: The object's transformation is computed by feature matching and point cloud registration between the initial and goal states. (c) Low-level Policy: The gripper's goal pose is derived by applying the object's transformation to a contact pose, after which a motion planner generates the final trajectory for robot execution.
  • Figure 3: An example of our Reflection-through-Synthesis process, which corrects a semantically correct but infeasible goal by refining the generation prompt.
  • Figure 4: Ablation Study. The performance of our full model (”World Model w/ Instruction & max 3 Reflection”), shown by the purple line, surpasses all ablated variants.
  • Figure 5: Qualitative Results from Real-World Experiments.