Table of Contents
Fetching ...

VLA-Reasoner: Empowering Vision-Language-Action Models with Reasoning via Online Monte Carlo Tree Search

Wenkai Guo, Guanxing Lu, Haoyuan Deng, Zhenyu Wu, Yansong Tang, Ziwei Wang

TL;DR

A plug-in framework that effectively empowers off-the-shelf VLAs with the capability of foreseeing future states via test-time scaling and leveraging Monte Carlo Tree Search to improve search efficiency in large action spaces, where step-wise VLA predictions seed the root.

Abstract

Vision-Language-Action models (VLAs) achieve strong performance in general robotic manipulation tasks by scaling imitation learning. However, existing VLAs are limited to predicting short-sighted next-action, which struggle with long-horizon trajectory tasks due to incremental deviations. To address this problem, we propose a plug-in framework named \method that effectively empowers off-the-shelf VLAs with the capability of foreseeing future states via test-time scaling. Specifically, \method samples and rolls out possible action trajectories where involved actions are rationales to generate future states via a world model, which enables \method to foresee and reason potential outcomes and search for the optimal actions. We further leverage Monte Carlo Tree Search (MCTS) to improve search efficiency in large action spaces, where step-wise VLA predictions seed the root. Meanwhile, we introduce a confidence sampling mechanism based on Kernel Density Estimation (KDE), to enable efficient exploration in MCTS without redundant VLA queries. We evaluate intermediate states in MCTS via an offline value estimation strategy, to score predicted futures and correct deviations with long-term feedback. We conducted extensive experiments in both simulators and the real world, demonstrating that our proposed VLA-Reasoner achieves significant improvements over the state-of-the-art VLAs. Our method highlights a potential pathway toward scalable test-time computation of robotic manipulation. The project website is available at: https://vla-reasoner.github.io/.

VLA-Reasoner: Empowering Vision-Language-Action Models with Reasoning via Online Monte Carlo Tree Search

TL;DR

A plug-in framework that effectively empowers off-the-shelf VLAs with the capability of foreseeing future states via test-time scaling and leveraging Monte Carlo Tree Search to improve search efficiency in large action spaces, where step-wise VLA predictions seed the root.

Abstract

Vision-Language-Action models (VLAs) achieve strong performance in general robotic manipulation tasks by scaling imitation learning. However, existing VLAs are limited to predicting short-sighted next-action, which struggle with long-horizon trajectory tasks due to incremental deviations. To address this problem, we propose a plug-in framework named \method that effectively empowers off-the-shelf VLAs with the capability of foreseeing future states via test-time scaling. Specifically, \method samples and rolls out possible action trajectories where involved actions are rationales to generate future states via a world model, which enables \method to foresee and reason potential outcomes and search for the optimal actions. We further leverage Monte Carlo Tree Search (MCTS) to improve search efficiency in large action spaces, where step-wise VLA predictions seed the root. Meanwhile, we introduce a confidence sampling mechanism based on Kernel Density Estimation (KDE), to enable efficient exploration in MCTS without redundant VLA queries. We evaluate intermediate states in MCTS via an offline value estimation strategy, to score predicted futures and correct deviations with long-term feedback. We conducted extensive experiments in both simulators and the real world, demonstrating that our proposed VLA-Reasoner achieves significant improvements over the state-of-the-art VLAs. Our method highlights a potential pathway toward scalable test-time computation of robotic manipulation. The project website is available at: https://vla-reasoner.github.io/.

Paper Structure

This paper contains 24 sections, 7 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: VLA-Reasoner augments VLA models with test-time reasoning via online tree search, enabling more robust and interpretable robotic manipulation than baselines.
  • Figure 2: The overall pipeline of VLA-Reasoner. At test time, a lightweight and modified MCTS searches for the optimal action conditioned on the VLA prediction. The search is steered by expert-like sampling and dense value estimation, which guide expansion and backup throughout the tree. The method is plug-and-play, and it can be attached to any VLA-based manipulation policy and consistently improves performance across tasks, environments, and robot embodiments.
  • Figure 3: Setup of real world experiments. We conduct diverse tasks in the real world to identify the limitations of current VLAs and validate our method.
  • Figure 4: Case Visualization. The baseline policy ($\pi_{0}$-FAST, top row) suffers from excessive action drift and fails by such deviations. With reasoning, VLA-Reasoner (bottom row) proactively corrects misalignment via value-guided search, enabling success.
  • Figure 5: Analysis on injection strength $\alpha$.$\alpha$ controls the trade-off between the VLA action and the reasoner action; a larger $\alpha$ assigns greater weight to the VLA action. $\alpha=1.0$ means the vanilla VLA.
  • ...and 2 more figures