Table of Contents
Fetching ...

DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

Cheng Yin, Yankai Lin, Wang Xu, Sikyuen Tam, Xiangrui Zeng, Zhiyuan Liu, Zhouping Yin

TL;DR

The paper tackles data-hungry vision–language–action models by introducing a think-before-acting paradigm that separates reasoning from action. It proposes DeepThinkVLA, a hybrid-attention decoder that uses causal attention for sequential CoT and bidirectional attention for parallel action decoding, paired with a two-stage training pipeline of supervised fine-tuning and outcome-based reinforcement learning. The approach yields state-of-the-art results on LIBERO, notably 97.0% average success and robust gains across object, spatial, goal, and long-horizon tasks, with ablations showing the architectural choice and RL stage as key contributors. This work demonstrates that co-designing architecture and training to align chain-of-thought with action substantially improves reliability and performance in embodied AI.

Abstract

Enabling Vision-Language-Action (VLA) models to "think before acting" via Chain-of-Thought (CoT) is a promising path to overcoming the data-hungry nature of end-to-end robot policies. However, progress is stalled by a fundamental conflict: existing models use a single autoregressive decoder for both sequential CoT reasoning and high-dimensional, parallelizable robot actions. This architectural mismatch degrades motor control and fails to forge a strong causal link between thought and action. We introduce DeepThinkVLA, which resolves this conflict through a tightly integrated architecture and training strategy. Architecturally, our hybrid-attention decoder generates sequential CoT with causal attention and then switches to bidirectional attention for fast, parallel decoding of action vectors. This design is complemented by a two-stage training pipeline: we first use Supervised Fine-Tuning (SFT) to teach the model foundational reasoning, then apply Reinforcement Learning (RL) with task-success rewards to causally align the full reasoning-action sequence with desired outcomes. This synergy leads to state-of-the-art performance, achieving a 97.0% success rate on the LIBERO benchmark. Our ablations confirm the design's effectiveness: the hybrid architecture alone outperforms standard decoders by 15.5%, and the final RL stage provides a crucial 2% boost to secure top performance.

DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

TL;DR

The paper tackles data-hungry vision–language–action models by introducing a think-before-acting paradigm that separates reasoning from action. It proposes DeepThinkVLA, a hybrid-attention decoder that uses causal attention for sequential CoT and bidirectional attention for parallel action decoding, paired with a two-stage training pipeline of supervised fine-tuning and outcome-based reinforcement learning. The approach yields state-of-the-art results on LIBERO, notably 97.0% average success and robust gains across object, spatial, goal, and long-horizon tasks, with ablations showing the architectural choice and RL stage as key contributors. This work demonstrates that co-designing architecture and training to align chain-of-thought with action substantially improves reliability and performance in embodied AI.

Abstract

Enabling Vision-Language-Action (VLA) models to "think before acting" via Chain-of-Thought (CoT) is a promising path to overcoming the data-hungry nature of end-to-end robot policies. However, progress is stalled by a fundamental conflict: existing models use a single autoregressive decoder for both sequential CoT reasoning and high-dimensional, parallelizable robot actions. This architectural mismatch degrades motor control and fails to forge a strong causal link between thought and action. We introduce DeepThinkVLA, which resolves this conflict through a tightly integrated architecture and training strategy. Architecturally, our hybrid-attention decoder generates sequential CoT with causal attention and then switches to bidirectional attention for fast, parallel decoding of action vectors. This design is complemented by a two-stage training pipeline: we first use Supervised Fine-Tuning (SFT) to teach the model foundational reasoning, then apply Reinforcement Learning (RL) with task-success rewards to causally align the full reasoning-action sequence with desired outcomes. This synergy leads to state-of-the-art performance, achieving a 97.0% success rate on the LIBERO benchmark. Our ablations confirm the design's effectiveness: the hybrid architecture alone outperforms standard decoders by 15.5%, and the final RL stage provides a crucial 2% boost to secure top performance.

Paper Structure

This paper contains 34 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison of VLA architectures. Existing designs adopt either fully autoregressive decoding or parallel bidirectional decoding. DeepThinkVLA introduces a hybrid architecture, enabling autoregressive CoT reasoning alongside efficient parallel action generation.
  • Figure 2: Pipeline for constructing an embodied CoT dataset. Stage 1 extracts keyframes via gripper state signals and queries a cloud LVLM to generate CoT for those frames. Stage 2 fine-tunes a local vision–language model on the keyframe CoT and uses it to annotate the remaining frames.
  • Figure 3: Reinforcement learning stage with grouped credit assignment. The model generates CoT and action sequences that are executed in the simulator to produce trajectories with verifiable rewards. Rewards are grouped and standardized to compute token-level advantages, which update the policy via a clipped surrogate objective with KL regularization to the SFT reference.
  • Figure 4: Effect of RL on long-horizon task performance (LIBERO-Long). Bars show base SR for each model, while lighter shaded segments indicate gains over the baseline. For DeepThinkVLA, the additional teal segment highlights the extra improvement from RL over SFT (+2 pp). The figure illustrates that all DeepThinkVLA variants outperform the baseline, and RL further aligns CoT reasoning with action generation to boost success rate.
  • Figure 5: "Think before acting" enables error recovery. Comparison of rollouts on a LIBERO task. Left: the baseline misses the grasp and falls into a repetitive failure loop. Right DeepThinkVLA leverages a reasoning trace to restate the subgoal, correct mistakes and complete the task.
  • ...and 1 more figures