Table of Contents
Fetching ...

VLA-R1: Enhancing Reasoning in Vision-Language-Action Models

Angen Ye, Zeyu Zhang, Boyuan Wang, Xiaofeng Wang, Dapeng Zhang, Zheng Zhu

TL;DR

VLA-R1 addresses the lack of explicit, step-by-step reasoning in Vision-Language-Action models by combining chain-of-thought supervision with reinforcement learning from verifiable rewards (GRPO). It introduces a high-quality VLA-CoT-13K dataset and a data engine to align reasoning with affordance and trajectory annotations, enabling robust reasoning and execution across in-domain, out-of-domain, simulation, and real-robot settings. Empirical results show notable gains in affordance localization and trajectory accuracy, with state-of-the-art performance in OOD scenarios and practical viability on real robots. The work promises to narrow the gap between reasoning quality and action execution in embodied AI and provides resources for future research.

Abstract

Vision-Language-Action (VLA) models aim to unify perception, language understanding, and action generation, offering strong cross-task and cross-scene generalization with broad impact on embodied AI. However, current VLA models often lack explicit step-by-step reasoning, instead emitting final actions without considering affordance constraints or geometric relations. Their post-training pipelines also rarely reinforce reasoning quality, relying primarily on supervised fine-tuning with weak reward design. To address these challenges, we present VLA-R1, a reasoning-enhanced VLA that integrates Reinforcement Learning from Verifiable Rewards (RLVR) with Group Relative Policy Optimization (GRPO) to systematically optimize both reasoning and execution. Specifically, we design an RLVR-based post-training strategy with verifiable rewards for region alignment, trajectory consistency, and output formatting, thereby strengthening reasoning robustness and execution accuracy. Moreover, we develop VLA-CoT-13K, a high-quality dataset that provides chain-of-thought supervision explicitly aligned with affordance and trajectory annotations. Furthermore, extensive evaluations on in-domain, out-of-domain, simulation, and real-robot platforms demonstrate that VLA-R1 achieves superior generalization and real-world performance compared to prior VLA methods. We plan to release the model, code, and dataset following the publication of this work. Code: https://github.com/GigaAI-research/VLA-R1. Website: https://gigaai-research.github.io/VLA-R1.

VLA-R1: Enhancing Reasoning in Vision-Language-Action Models

TL;DR

VLA-R1 addresses the lack of explicit, step-by-step reasoning in Vision-Language-Action models by combining chain-of-thought supervision with reinforcement learning from verifiable rewards (GRPO). It introduces a high-quality VLA-CoT-13K dataset and a data engine to align reasoning with affordance and trajectory annotations, enabling robust reasoning and execution across in-domain, out-of-domain, simulation, and real-robot settings. Empirical results show notable gains in affordance localization and trajectory accuracy, with state-of-the-art performance in OOD scenarios and practical viability on real robots. The work promises to narrow the gap between reasoning quality and action execution in embodied AI and provides resources for future research.

Abstract

Vision-Language-Action (VLA) models aim to unify perception, language understanding, and action generation, offering strong cross-task and cross-scene generalization with broad impact on embodied AI. However, current VLA models often lack explicit step-by-step reasoning, instead emitting final actions without considering affordance constraints or geometric relations. Their post-training pipelines also rarely reinforce reasoning quality, relying primarily on supervised fine-tuning with weak reward design. To address these challenges, we present VLA-R1, a reasoning-enhanced VLA that integrates Reinforcement Learning from Verifiable Rewards (RLVR) with Group Relative Policy Optimization (GRPO) to systematically optimize both reasoning and execution. Specifically, we design an RLVR-based post-training strategy with verifiable rewards for region alignment, trajectory consistency, and output formatting, thereby strengthening reasoning robustness and execution accuracy. Moreover, we develop VLA-CoT-13K, a high-quality dataset that provides chain-of-thought supervision explicitly aligned with affordance and trajectory annotations. Furthermore, extensive evaluations on in-domain, out-of-domain, simulation, and real-robot platforms demonstrate that VLA-R1 achieves superior generalization and real-world performance compared to prior VLA methods. We plan to release the model, code, and dataset following the publication of this work. Code: https://github.com/GigaAI-research/VLA-R1. Website: https://gigaai-research.github.io/VLA-R1.

Paper Structure

This paper contains 20 sections, 6 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: VLA-R1: pipeline from instruction to execution, with benchmark comparisons against baselines.
  • Figure 2: CoT Data Engine. After ingesting multimodal data, the system parses tasks based on type (e.g., affordance or trajectory), performs scene understanding and localization, validates feasibility, and generates structured CoT traces for training.
  • Figure 3: Overall architecture of VLA-R1. Training has two stages: Stage 1 uses SFT with CoT supervision to learn reasoning over images and instructions; Stage 2 refines reasoning and actions via RL with verifiable rewards (GRPO). During inference, a control stack converts outputs into joint-level robot commands.
  • Figure 4: Case Analysis: The figure illustrates VLA-R1’s reasoning process and outcomes for both affordance and trajectory tasks. VLA-R1 parses the action requirements, infers relevant objects and spatial relations, and outputs the corresponding bounding boxes or waypoint sequences. The affordance form and trajectory form are fixed prompt templates that instruct the model to produce outputs in a specified format.
  • Figure 5: Visualization of evaluation in real-world scenarios.
  • ...and 9 more figures