Table of Contents
Fetching ...

VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models

Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, Zeyu Jiang, Lin Shao

TL;DR

This work investigates how task planning representations and paradigms affect Vision-Language-Action (VLA) robots, introducing VLA-OS as a unified, plug-and-play model family with three variants (ActionOnly, Integrated, Hierarchical). Using controlled experiments across rigid/deformable objects, 2D/3D modalities, and simulation/real-world tasks, it shows that visually grounded planning (visual reasoning and image foresight) consistently outperforms language-based planning, and that Hierarchical-VLA achieves the best overall performance with favorable generalization and continual learning properties, albeit with higher training and inference costs. The study also demonstrates the benefits of planning-head pretraining and analyzes the distinct contributions of planning versus policy learning, providing practical guidance on data requirements and model scale. Overall, the results highlight actionable design choices for scalable, transferable VLA systems and offer a framework for systematic, fair comparisons across planning paradigms in robotics.

Abstract

Recent studies on Vision-Language-Action (VLA) models have shifted from the end-to-end action-generation paradigm toward a pipeline involving task planning followed by action generation, demonstrating improved performance on various complex, long-horizon manipulation tasks. However, existing approaches vary significantly in terms of network architectures, planning paradigms, representations, and training data sources, making it challenging for researchers to identify the precise sources of performance gains and components to be further improved. To systematically investigate the impacts of different planning paradigms and representations isolating from network architectures and training data, in this paper, we introduce VLA-OS, a unified VLA architecture series capable of various task planning paradigms, and design a comprehensive suite of controlled experiments across diverse object categories (rigid and deformable), visual modalities (2D and 3D), environments (simulation and real-world), and end-effectors (grippers and dexterous hands). Our results demonstrate that: 1) visually grounded planning representations are generally better than language planning representations; 2) the Hierarchical-VLA paradigm generally achieves superior or comparable performance than other paradigms on task performance, pretraining, generalization ability, scalability, and continual learning ability, albeit at the cost of slower training and inference speeds.

VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models

TL;DR

This work investigates how task planning representations and paradigms affect Vision-Language-Action (VLA) robots, introducing VLA-OS as a unified, plug-and-play model family with three variants (ActionOnly, Integrated, Hierarchical). Using controlled experiments across rigid/deformable objects, 2D/3D modalities, and simulation/real-world tasks, it shows that visually grounded planning (visual reasoning and image foresight) consistently outperforms language-based planning, and that Hierarchical-VLA achieves the best overall performance with favorable generalization and continual learning properties, albeit with higher training and inference costs. The study also demonstrates the benefits of planning-head pretraining and analyzes the distinct contributions of planning versus policy learning, providing practical guidance on data requirements and model scale. Overall, the results highlight actionable design choices for scalable, transferable VLA systems and offer a framework for systematic, fair comparisons across planning paradigms in robotics.

Abstract

Recent studies on Vision-Language-Action (VLA) models have shifted from the end-to-end action-generation paradigm toward a pipeline involving task planning followed by action generation, demonstrating improved performance on various complex, long-horizon manipulation tasks. However, existing approaches vary significantly in terms of network architectures, planning paradigms, representations, and training data sources, making it challenging for researchers to identify the precise sources of performance gains and components to be further improved. To systematically investigate the impacts of different planning paradigms and representations isolating from network architectures and training data, in this paper, we introduce VLA-OS, a unified VLA architecture series capable of various task planning paradigms, and design a comprehensive suite of controlled experiments across diverse object categories (rigid and deformable), visual modalities (2D and 3D), environments (simulation and real-world), and end-effectors (grippers and dexterous hands). Our results demonstrate that: 1) visually grounded planning representations are generally better than language planning representations; 2) the Hierarchical-VLA paradigm generally achieves superior or comparable performance than other paradigms on task performance, pretraining, generalization ability, scalability, and continual learning ability, albeit at the cost of slower training and inference speeds.

Paper Structure

This paper contains 41 sections, 3 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Left: four different VLA paradigms. Note that in this paper, we didn't explore PlanningOnly-VLA since they usually cannot be trained with the provided datasets and perform worse than others. Right: VLA paradigm comparison results. Hierarchical-VLA exhibits a generally better performance than ActionOnly-VLA and Integrated-VLA, while it incurs larger training and inference costs. This motivates future work on improving training and inference algorithms for Hierarchical-VLA models.
  • Figure 2: The VLA-OS model family. Left: the VLM and the composable heads. Our VLM has the same architecture with different numbers of parameters. Although we only draw Qwen2.5 here, our code supports any kind of LLM backbone from HuggingFace. Right: four VLA-OS architectures used in our experiments. To minimize the effects of different numbers of parameters in different models, we restrict the number of parameters of all heads to about $5\%$ of the VLM.
  • Figure 3: The formats and contents of the language reasoning dataset, the visual reasoning dataset, and the image foresight reasoning dataset in this work. We use various vision-language models for data annotation. We illustrate the language reasoning data annotation process on the top left part.
  • Figure 4: Benchmarks used in our evaluations, including LIBERO libero and FurnitureBench furniturebench for 2D rigid body manipulation experiments, The COLOSSEUM colosseum for 3D and generalization evaluation, real-world deformable object manipulation tasks (fold the handkerchief, unfold the jean, and straighten the rope), DexArt dexart for dexterous tasks, and PerAct2 peract2 for dual-arm tasks.
  • Figure 5: Comparison between VLA-OS-I-E and VLA-OS-H with the same planning errors. The three planning representations shown in this figure all have small planning errors (highlighted).
  • ...and 4 more figures