VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

Chaoyang Wang; Wenrui Bao; Sicheng Gao; Bingxin Xu; Yu Tian; Yogesh S. Rawat; Yunhao Ge; Yuzhang Shang

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

Chaoyang Wang, Wenrui Bao, Sicheng Gao, Bingxin Xu, Yu Tian, Yogesh S. Rawat, Yunhao Ge, Yuzhang Shang

Abstract

Vision-Language-Action (VLA) models have shown promising capabilities for embodied intelligence, but most existing approaches rely on text-based chain-of-thought reasoning where visual inputs are treated as static context. This limits the ability of the model to actively revisit the environment and resolve ambiguities during long-horizon tasks. We propose VLA-Thinker, a thinking-with-image reasoning framework that models perception as a dynamically invocable reasoning action. To train such a system, we introduce a two-stage training pipeline consisting of (1) an SFT cold-start phase with curated visual Chain-of-Thought data to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align complete reasoning-action trajectories with task-level success. Extensive experiments on LIBERO and RoboTwin 2.0 benchmarks demonstrate that VLA-Thinker significantly improves manipulation performance, achieving 97.5% success rate on LIBERO and strong gains across long-horizon robotic tasks. Project and Codes: https://cywang735.github.io/VLA-Thinker/ .

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

Abstract

Paper Structure (16 sections, 7 equations, 4 figures, 4 tables)

This paper contains 16 sections, 7 equations, 4 figures, 4 tables.

Introduction
Method
Problem Formulation
Training Strategies
Discussion
Experiment
Experimental Setup
Main Results
Ablation Study
Training Curves
Related Work
Conclusion
Prompt Template
Additional Implementation Details
Inference Speed
...and 1 more sections

Figures (4)

Figure 1: Comparison between text-based CoT Reasoning (left) and Thinking-with-Image Reasoning (right) for VLA. Left: Conventional VLA reasoning models adopt a text-based Chain-of-Thought reasoning paradigm, treating visual inputs as static context, which fails to successfully grasp the target object. Right: Our proposed thinking-with-image framework models perception as a dynamically invocable reasoning action, enabling the model to call visual tools during intermediate reasoning steps and realize an interleaved perception–reasoning–action process, ultimately completing the manipulation task successfully.
Figure 2: The upper panel illustrates the main process of our proposed Thinking-with-Image framework. Language instructions and visual observations are encoded into a shared VLM, enabling interleaved reasoning and dynamic zoom-in perception before action generation. The lower panel presents the two-stage training strategy: (1) SFT cold-start to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align multimodal reasoning–action trajectories with task-level objectives under sparse rewards.
Figure 3: RL Training curves. (a) Task success reward steadily increases during GRPO training, demonstrating effective trajectory-level alignment under sparse rewards. (b) The average response length gradually decreases, indicating that the policy learns to invoke visual tools more selectively and reduce redundant reasoning.
Figure 4: Prompt template for training and inference.

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

Abstract

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

Authors

Abstract

Table of Contents

Figures (4)