Table of Contents
Fetching ...

GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning

Guoqing Ma, Siheng Wang, Zeyu Zhang, Shan Yu, Hao Tang

TL;DR

GeneralVLA introduces a hierarchical vision-language-action framework that decouples perception, planning, and control to achieve robust zero-shot robotic manipulation. By integrating an Affordance Segmentation Module (ASM), a Knowledge-Guided Trajectory Planning via a 3DAgent, and a Collision-aware Grasping Module (HGM), the approach leverages foundation-model priors to generate diverse, high-quality robotic data without real-world demonstrations. Empirical results in simulation and real-world settings demonstrate strong zero-shot performance across multiple tasks, with additional benefits for behavior cloning policies trained on GeneralVLA-generated data. The findings suggest that hierarchical decomposition and knowledge grounding can substantially scale zero-shot robotics, enabling long-horizon planning and robust manipulation with reduced data requirements.

Abstract

Large foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics. One fundamental challenge is that the models exhibit limited zero-shot capability, which hampers their ability to generalize effectively to unseen scenarios. In this work, we propose GeneralVLA (Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning), a hierarchical vision-language-action (VLA) model that can be more effective in utilizing the generalization of foundation models, enabling zero-shot manipulation and automatically generating data for robotics. In particular, we study a class of hierarchical VLA model where the high-level ASM (Affordance Segmentation Module) is finetuned to perceive image keypoint affordances of the scene; the mid-level 3DAgent carries out task understanding, skill knowledge, and trajectory planning to produce a 3D path indicating the desired robot end-effector trajectory. The intermediate 3D path prediction is then served as guidance to the low-level, 3D-aware control policy capable of precise manipulation. Compared to alternative approaches, our method requires no real-world robotic data collection or human demonstration, making it much more scalable to diverse tasks and viewpoints. Empirically, GeneralVLA successfully generates trajectories for 14 tasks, significantly outperforming state-of-the-art methods such as VoxPoser. The generated demonstrations can train more robust behavior cloning policies than training with human demonstrations or from data generated by VoxPoser, Scaling-up, and Code-As-Policies. We believe GeneralVLA can be the scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting. Code: https://github.com/AIGeeksGroup/GeneralVLA. Website: https://aigeeksgroup.github.io/GeneralVLA.

GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning

TL;DR

GeneralVLA introduces a hierarchical vision-language-action framework that decouples perception, planning, and control to achieve robust zero-shot robotic manipulation. By integrating an Affordance Segmentation Module (ASM), a Knowledge-Guided Trajectory Planning via a 3DAgent, and a Collision-aware Grasping Module (HGM), the approach leverages foundation-model priors to generate diverse, high-quality robotic data without real-world demonstrations. Empirical results in simulation and real-world settings demonstrate strong zero-shot performance across multiple tasks, with additional benefits for behavior cloning policies trained on GeneralVLA-generated data. The findings suggest that hierarchical decomposition and knowledge grounding can substantially scale zero-shot robotics, enabling long-horizon planning and robust manipulation with reduced data requirements.

Abstract

Large foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics. One fundamental challenge is that the models exhibit limited zero-shot capability, which hampers their ability to generalize effectively to unseen scenarios. In this work, we propose GeneralVLA (Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning), a hierarchical vision-language-action (VLA) model that can be more effective in utilizing the generalization of foundation models, enabling zero-shot manipulation and automatically generating data for robotics. In particular, we study a class of hierarchical VLA model where the high-level ASM (Affordance Segmentation Module) is finetuned to perceive image keypoint affordances of the scene; the mid-level 3DAgent carries out task understanding, skill knowledge, and trajectory planning to produce a 3D path indicating the desired robot end-effector trajectory. The intermediate 3D path prediction is then served as guidance to the low-level, 3D-aware control policy capable of precise manipulation. Compared to alternative approaches, our method requires no real-world robotic data collection or human demonstration, making it much more scalable to diverse tasks and viewpoints. Empirically, GeneralVLA successfully generates trajectories for 14 tasks, significantly outperforming state-of-the-art methods such as VoxPoser. The generated demonstrations can train more robust behavior cloning policies than training with human demonstrations or from data generated by VoxPoser, Scaling-up, and Code-As-Policies. We believe GeneralVLA can be the scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting. Code: https://github.com/AIGeeksGroup/GeneralVLA. Website: https://aigeeksgroup.github.io/GeneralVLA.
Paper Structure (38 sections, 1 equation, 11 figures, 7 tables)

This paper contains 38 sections, 1 equation, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Overview of GeneralVLA, VLAs and earlier imitation learning methods. GeneralVLA’s hierarchical design results in better generalization. It enables 3D trajectory planning framework that fully exploits the prior knowledge of foundation models.
  • Figure 2: Inference workflow of of GeneralVLA. (a) The high-level ASM is called to generate the 2D points and corresponding semantic information. (b) The mid-level Knowledge-Guided Trajectory Planning carries out task understanding, 3D reasoning and planning to produce a 3D path indicating the desired robot end-effector trajectory. (c) The intermediate 3D path prediction is then served as guidance to the low-level, 3D-aware control policy enhanced by HGM for precise manipulation.
  • Figure 3: Detailed framework of ASM and 3DAgent. (a) Given the input image and task text as query, the multimodal LLM (e.g., LLaVA llavanext) generates text output. The last-layer embedding for the <SEG> token is then decoded into the segmentation mask via the decoder. We use LoRA LoRA for efficient fine-tuning. The choice of vision backbone can be flexible (e.g., SAM3 sam3).
  • Figure 4: Example GeneralVLA rollouts demonstrate its strong performance in multi-object, multi-stage scenes, achieved by leveraging ASM’s segmentation capability, 3DAgent’s spatial reasoning ability, and the robust execution of the low-level 3D policy.
  • Figure 5: GeneralVLA is an open-vocabulary robot demonstration generation system. We show zero-shot demonstrations for 4 tasks in the real world.
  • ...and 6 more figures