Table of Contents
Fetching ...

Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving

Dapeng Zhang, Zhenlong Yuan, Zhangquan Chen, Chih-Ting Liao, Yinda Chen, Fei Shen, Qingguo Zhou, Tat-Seng Chua

TL;DR

Reasoning-VLA introduces a fast, generalVision-Language-Action framework for autonomous driving that decouples action generation from token-by-token decoding by using learnable action queries that interact with a reasoning-enhanced vision–language model. Actions are decoded in parallel, enabling real-time trajectory generation, while a refinement module improves precision. Generalization is built from a unified eight-dataset CoT-based training corpus using supervised fine-tuning and reinforcement learning with physics-informed rewards. Empirical results show state-of-the-art open- and closed-loop performance, strong generalization across diverse platforms, and substantial inference-speed advantages. The work provides a practical, scalable base model for autonomous driving tasks and highlights the value of unified reasoning datasets and parallel action decoding for VLA systems.

Abstract

Vision-Language-Action (VLA) models have recently shown strong decision-making capabilities in autonomous driving. However, existing VLAs often struggle with achieving efficient inference and generalizing to novel autonomous vehicle configurations and driving scenarios. In this paper, we propose Reasoning-VLA, a general and fast action-generation VLA framework. The proposed model employs a set of learnable action queries, initialized via Gaussian sampling from ground-truth trajectories within the training corpus. These learnable queries interact with reasoning-enhanced vision-language features to generate continuous action trajectories in parallel. To promote robust generalization, we consolidate eight publicly available autonomous driving datasets into a standardized, Chain-of-Thought reasoning-based, and easy-to-use data format for model training. Leveraging both supervised learning and reinforcement learning fine-tuning, extensive empirical evaluations across multiple benchmarks demonstrate that Reasoning-VLA achieves state-of-the-art performance, superior generalization capability, and the excellent inference speed reported to date.

Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving

TL;DR

Reasoning-VLA introduces a fast, generalVision-Language-Action framework for autonomous driving that decouples action generation from token-by-token decoding by using learnable action queries that interact with a reasoning-enhanced vision–language model. Actions are decoded in parallel, enabling real-time trajectory generation, while a refinement module improves precision. Generalization is built from a unified eight-dataset CoT-based training corpus using supervised fine-tuning and reinforcement learning with physics-informed rewards. Empirical results show state-of-the-art open- and closed-loop performance, strong generalization across diverse platforms, and substantial inference-speed advantages. The work provides a practical, scalable base model for autonomous driving tasks and highlights the value of unified reasoning datasets and parallel action decoding for VLA systems.

Abstract

Vision-Language-Action (VLA) models have recently shown strong decision-making capabilities in autonomous driving. However, existing VLAs often struggle with achieving efficient inference and generalizing to novel autonomous vehicle configurations and driving scenarios. In this paper, we propose Reasoning-VLA, a general and fast action-generation VLA framework. The proposed model employs a set of learnable action queries, initialized via Gaussian sampling from ground-truth trajectories within the training corpus. These learnable queries interact with reasoning-enhanced vision-language features to generate continuous action trajectories in parallel. To promote robust generalization, we consolidate eight publicly available autonomous driving datasets into a standardized, Chain-of-Thought reasoning-based, and easy-to-use data format for model training. Leveraging both supervised learning and reinforcement learning fine-tuning, extensive empirical evaluations across multiple benchmarks demonstrate that Reasoning-VLA achieves state-of-the-art performance, superior generalization capability, and the excellent inference speed reported to date.

Paper Structure

This paper contains 40 sections, 7 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Reasoning-VLA is an efficient Vision–Language–Action (VLA) framework for autonomous driving that employs parallel actions to interact with reasoning-enhanced vision–language models (VLMs), enabling one-step prediction of future trajectories. The model is trained on our unified and generalized autonomous driving dataset using a combination of supervised fine-tuning (SFT) and reinforcement learning (RL), guided by specifically designed rule-based reward functions.
  • Figure 2: The action module interacts with the vision-language model (VLM). The learnable action queries are initialized using a Gaussian distribution derived from the ground-truth (GT) action data. Through self-attention and cross-attention mechanisms with the reasoning VLM, the model transfers the generalized reasoning capability from the VL to A.
  • Figure 3: Statistical distribution of the unified dataset.
  • Figure 4: Qualitative Results of Action Trajectories. Reasoning-VLA predictions on eight different datasets.Red lines denote GT trajectories while green lines represent predicted trajectories.
  • Figure 5: Pipeline for generating the unified reasoning dataset.