Table of Contents
Fetching ...

ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge

Zhongyi Zhou, Yichen Zhu, Junjie Wen, Chaomin Shen, Yi Xu

TL;DR

ChatVLA-2 addresses the erosion of pretrained knowledge in vision-language-action models by combining a dynamic mixture-of-experts backbone with a reasoning-following module and a two-stage training scheme to retain open-world reasoning while enabling actionable robot control. It demonstrates robust open-world reasoning on math-matching and spatial-placement tasks, including OCR and math reasoning, outperforming OpenVLA, DexVLA, and pi-0 baselines in generalization. The approach advances toward truly generalizable robotic foundation models capable of leveraging pretrained knowledge for robust reasoning and control in novel environments.

Abstract

Vision-language-action (VLA) models have emerged as the next generation of models in robotics. However, despite leveraging powerful pre-trained Vision-Language Models (VLMs), existing end-to-end VLA systems often lose key capabilities during fine-tuning as the model adapts to specific robotic tasks. We argue that a generalizable VLA model should retain and expand upon the VLM's core competencies: 1) Open-world embodied reasoning - the VLA should inherit the knowledge from VLM, i.e., recognize anything that the VLM can recognize, be capable of solving math problems, and possess visual-spatial intelligence, 2) Reasoning following - effectively translating the open-world reasoning into actionable steps for the robot. In this work, we introduce ChatVLA-2, a novel mixture-of-expert VLA model coupled with a specialized two-stage training pipeline designed to preserve the VLM's original strengths while enabling actionable reasoning. To validate our approach, we design a math-matching task wherein a robot interprets math problems written on a whiteboard and picks corresponding number cards from a table to solve equations. Remarkably, our method exhibits exceptional mathematical reasoning and OCR capabilities, despite these abilities not being explicitly trained within the VLA. Furthermore, we demonstrate that the VLA possesses strong spatial reasoning skills, enabling it to interpret novel directional instructions involving previously unseen objects. Overall, our method showcases reasoning and comprehension abilities that significantly surpass state-of-the-art imitation learning methods such as OpenVLA, DexVLA, and pi-zero. This work represents a substantial advancement toward developing truly generalizable robotic foundation models endowed with robust reasoning capacities.

ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge

TL;DR

ChatVLA-2 addresses the erosion of pretrained knowledge in vision-language-action models by combining a dynamic mixture-of-experts backbone with a reasoning-following module and a two-stage training scheme to retain open-world reasoning while enabling actionable robot control. It demonstrates robust open-world reasoning on math-matching and spatial-placement tasks, including OCR and math reasoning, outperforming OpenVLA, DexVLA, and pi-0 baselines in generalization. The approach advances toward truly generalizable robotic foundation models capable of leveraging pretrained knowledge for robust reasoning and control in novel environments.

Abstract

Vision-language-action (VLA) models have emerged as the next generation of models in robotics. However, despite leveraging powerful pre-trained Vision-Language Models (VLMs), existing end-to-end VLA systems often lose key capabilities during fine-tuning as the model adapts to specific robotic tasks. We argue that a generalizable VLA model should retain and expand upon the VLM's core competencies: 1) Open-world embodied reasoning - the VLA should inherit the knowledge from VLM, i.e., recognize anything that the VLM can recognize, be capable of solving math problems, and possess visual-spatial intelligence, 2) Reasoning following - effectively translating the open-world reasoning into actionable steps for the robot. In this work, we introduce ChatVLA-2, a novel mixture-of-expert VLA model coupled with a specialized two-stage training pipeline designed to preserve the VLM's original strengths while enabling actionable reasoning. To validate our approach, we design a math-matching task wherein a robot interprets math problems written on a whiteboard and picks corresponding number cards from a table to solve equations. Remarkably, our method exhibits exceptional mathematical reasoning and OCR capabilities, despite these abilities not being explicitly trained within the VLA. Furthermore, we demonstrate that the VLA possesses strong spatial reasoning skills, enabling it to interpret novel directional instructions involving previously unseen objects. Overall, our method showcases reasoning and comprehension abilities that significantly surpass state-of-the-art imitation learning methods such as OpenVLA, DexVLA, and pi-zero. This work represents a substantial advancement toward developing truly generalizable robotic foundation models endowed with robust reasoning capacities.

Paper Structure

This paper contains 18 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Our proposed ChatVLA-2 model enables generalized open-world embodied reasoning and reasoning following abilities. We designed two tasks—a math matching game and a toy placement experiment—to demonstrate its generalization ability.
  • Figure 2: Model architecture.Left: A reasoning-following enhancement module is incorporated to ensure that the VLA model adheres to logical reasoning when performing actions. Right: Our method leverages a dynamic mixture-of-experts architecture to disentangle conflicting features between multimodal understanding and robotic control, while effectively integrating mutually beneficial features.
  • Figure 3: Training Strategy. We leverage a two-stage training strategy. In the first stage, we perform co-training on image-text data and robot data to empower VLA with open-world reasoning capabilities. In the second stage, we freeze the entire VLM and train only the action expert, thereby preserving open-world reasoning while enhancing instruction-following abilities in VLA.
  • Figure 4: Experimental setup for math matching game and toy placement. We use a Franka Emika robot equipped with a Robotiq gripper to pick and place items at specified target locations. We utilize the ARX R5 bimanual robots with a top camera of RealSense L515. Our experiments demonstrate that the proposed method successfully completes tasks involving previously unseen spatial instructions and novel objects.