Table of Contents
Fetching ...

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

Chaoyang Wang, Wenrui Bao, Sicheng Gao, Bingxin Xu, Yu Tian, Yogesh S. Rawat, Yunhao Ge, Yuzhang Shang

Abstract

Vision-Language-Action (VLA) models have shown promising capabilities for embodied intelligence, but most existing approaches rely on text-based chain-of-thought reasoning where visual inputs are treated as static context. This limits the ability of the model to actively revisit the environment and resolve ambiguities during long-horizon tasks. We propose VLA-Thinker, a thinking-with-image reasoning framework that models perception as a dynamically invocable reasoning action. To train such a system, we introduce a two-stage training pipeline consisting of (1) an SFT cold-start phase with curated visual Chain-of-Thought data to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align complete reasoning-action trajectories with task-level success. Extensive experiments on LIBERO and RoboTwin 2.0 benchmarks demonstrate that VLA-Thinker significantly improves manipulation performance, achieving 97.5% success rate on LIBERO and strong gains across long-horizon robotic tasks. Project and Codes: https://cywang735.github.io/VLA-Thinker/ .

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

Abstract

Vision-Language-Action (VLA) models have shown promising capabilities for embodied intelligence, but most existing approaches rely on text-based chain-of-thought reasoning where visual inputs are treated as static context. This limits the ability of the model to actively revisit the environment and resolve ambiguities during long-horizon tasks. We propose VLA-Thinker, a thinking-with-image reasoning framework that models perception as a dynamically invocable reasoning action. To train such a system, we introduce a two-stage training pipeline consisting of (1) an SFT cold-start phase with curated visual Chain-of-Thought data to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align complete reasoning-action trajectories with task-level success. Extensive experiments on LIBERO and RoboTwin 2.0 benchmarks demonstrate that VLA-Thinker significantly improves manipulation performance, achieving 97.5% success rate on LIBERO and strong gains across long-horizon robotic tasks. Project and Codes: https://cywang735.github.io/VLA-Thinker/ .
Paper Structure (16 sections, 7 equations, 4 figures, 4 tables)

This paper contains 16 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison between text-based CoT Reasoning (left) and Thinking-with-Image Reasoning (right) for VLA. Left: Conventional VLA reasoning models adopt a text-based Chain-of-Thought reasoning paradigm, treating visual inputs as static context, which fails to successfully grasp the target object. Right: Our proposed thinking-with-image framework models perception as a dynamically invocable reasoning action, enabling the model to call visual tools during intermediate reasoning steps and realize an interleaved perception–reasoning–action process, ultimately completing the manipulation task successfully.
  • Figure 2: The upper panel illustrates the main process of our proposed Thinking-with-Image framework. Language instructions and visual observations are encoded into a shared VLM, enabling interleaved reasoning and dynamic zoom-in perception before action generation. The lower panel presents the two-stage training strategy: (1) SFT cold-start to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align multimodal reasoning–action trajectories with task-level objectives under sparse rewards.
  • Figure 3: RL Training curves. (a) Task success reward steadily increases during GRPO training, demonstrating effective trajectory-level alignment under sparse rewards. (b) The average response length gradually decreases, indicating that the policy learns to invoke visual tools more selectively and reduce redundant reasoning.
  • Figure 4: Prompt template for training and inference.