DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

Cheng Yin; Yankai Lin; Wang Xu; Sikyuen Tam; Xiangrui Zeng; Zhiyuan Liu; Zhouping Yin

DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

Cheng Yin, Yankai Lin, Wang Xu, Sikyuen Tam, Xiangrui Zeng, Zhiyuan Liu, Zhouping Yin

TL;DR

The paper tackles data-hungry vision–language–action models by introducing a think-before-acting paradigm that separates reasoning from action. It proposes DeepThinkVLA, a hybrid-attention decoder that uses causal attention for sequential CoT and bidirectional attention for parallel action decoding, paired with a two-stage training pipeline of supervised fine-tuning and outcome-based reinforcement learning. The approach yields state-of-the-art results on LIBERO, notably 97.0% average success and robust gains across object, spatial, goal, and long-horizon tasks, with ablations showing the architectural choice and RL stage as key contributors. This work demonstrates that co-designing architecture and training to align chain-of-thought with action substantially improves reliability and performance in embodied AI.

Abstract

Enabling Vision-Language-Action (VLA) models to "think before acting" via Chain-of-Thought (CoT) is a promising path to overcoming the data-hungry nature of end-to-end robot policies. However, progress is stalled by a fundamental conflict: existing models use a single autoregressive decoder for both sequential CoT reasoning and high-dimensional, parallelizable robot actions. This architectural mismatch degrades motor control and fails to forge a strong causal link between thought and action. We introduce DeepThinkVLA, which resolves this conflict through a tightly integrated architecture and training strategy. Architecturally, our hybrid-attention decoder generates sequential CoT with causal attention and then switches to bidirectional attention for fast, parallel decoding of action vectors. This design is complemented by a two-stage training pipeline: we first use Supervised Fine-Tuning (SFT) to teach the model foundational reasoning, then apply Reinforcement Learning (RL) with task-success rewards to causally align the full reasoning-action sequence with desired outcomes. This synergy leads to state-of-the-art performance, achieving a 97.0% success rate on the LIBERO benchmark. Our ablations confirm the design's effectiveness: the hybrid architecture alone outperforms standard decoders by 15.5%, and the final RL stage provides a crucial 2% boost to secure top performance.

DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

TL;DR

Abstract

DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)