Table of Contents
Fetching ...

Improving Pre-Trained Vision-Language-Action Policies with Model-Based Search

Cyrus Neary, Omar G. Younis, Artur Kuramshin, Ozgur Aslan, Glen Berseth

TL;DR

VLAPS addresses the brittleness of pre-trained Vision-Language-Action policies by integrating model-based search at inference time. It leverages a world model to simulate future outcomes and uses VLA-derived priors to guide exploration over temporally-abstract action chunks, enabling efficient planning in large action spaces. Across Libero language-conditioned tasks, VLAPS consistently improves over VLA baselines, with gains up to substantial margins and the ability for small VLAs to match larger state-of-the-art models. Importantly, VLAPS requires no additional training and is agnostic to the specific VLA used, offering a practical path to robust, language-conditioned robotic planning with controllable compute.

Abstract

Pre-trained vision-language-action (VLA) models offer a promising foundation for generalist robot policies, but often produce brittle behaviors or unsafe failures when deployed zero-shot in out-of-distribution scenarios. We present Vision-Language-Action Planning & Search (VLAPS) -- a novel framework and accompanying algorithms that embed model-based search into the inference procedure of pre-trained VLA policies to improve their performance on robotic tasks. Specifically, our method biases a modified Monte Carlo Tree Search (MCTS) algorithm -- run using a model of the target environment -- using action priors defined by the VLA policy. By using VLA-derived abstractions and priors in model-based search, VLAPS efficiently explores language-conditioned robotics tasks whose search spaces would otherwise be intractably large. Conversely, by integrating model-based search with the VLA policy's inference procedure, VLAPS yields behaviors that are more performant than those obtained by directly following the VLA policy's action predictions. VLAPS offers a principled framework to: i) control test-time compute in VLA models, ii) leverage a priori knowledge of the robotic environment, and iii) integrate established planning and reinforcement learning techniques into the VLA inference process. Across all experiments, VLAPS significantly outperforms VLA-only baselines on language-specified tasks that would otherwise be intractable for uninformed search algorithms, increasing success rates by as much as 67 percentage points.

Improving Pre-Trained Vision-Language-Action Policies with Model-Based Search

TL;DR

VLAPS addresses the brittleness of pre-trained Vision-Language-Action policies by integrating model-based search at inference time. It leverages a world model to simulate future outcomes and uses VLA-derived priors to guide exploration over temporally-abstract action chunks, enabling efficient planning in large action spaces. Across Libero language-conditioned tasks, VLAPS consistently improves over VLA baselines, with gains up to substantial margins and the ability for small VLAs to match larger state-of-the-art models. Importantly, VLAPS requires no additional training and is agnostic to the specific VLA used, offering a practical path to robust, language-conditioned robotic planning with controllable compute.

Abstract

Pre-trained vision-language-action (VLA) models offer a promising foundation for generalist robot policies, but often produce brittle behaviors or unsafe failures when deployed zero-shot in out-of-distribution scenarios. We present Vision-Language-Action Planning & Search (VLAPS) -- a novel framework and accompanying algorithms that embed model-based search into the inference procedure of pre-trained VLA policies to improve their performance on robotic tasks. Specifically, our method biases a modified Monte Carlo Tree Search (MCTS) algorithm -- run using a model of the target environment -- using action priors defined by the VLA policy. By using VLA-derived abstractions and priors in model-based search, VLAPS efficiently explores language-conditioned robotics tasks whose search spaces would otherwise be intractably large. Conversely, by integrating model-based search with the VLA policy's inference procedure, VLAPS yields behaviors that are more performant than those obtained by directly following the VLA policy's action predictions. VLAPS offers a principled framework to: i) control test-time compute in VLA models, ii) leverage a priori knowledge of the robotic environment, and iii) integrate established planning and reinforcement learning techniques into the VLA inference process. Across all experiments, VLAPS significantly outperforms VLA-only baselines on language-specified tasks that would otherwise be intractable for uninformed search algorithms, increasing success rates by as much as 67 percentage points.

Paper Structure

This paper contains 30 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The proposed vlaps framework. At each decision-making point in the task, vlaps uses a world model to conduct a search over temporally-abstract action chunks. The method uses a pre-trained VLA policy to guide the expansion (§ \ref{['sec:defining_search_space']}) and selection (§ \ref{['sec:biasing_selection']}) procedures of the search. Once the search finds a node that completes the task, or a compute budget is exhausted, vlaps returns the most promising action sequence to be executed in the environment, before repeating the process with the next observation.
  • Figure 2: vla-informed node expansion and action selection. Whenever a new node is encountered, vlaps uses the vla policy to sample a discrete set of candidate action chunks to search over (\ref{['sec:defining_search_space']}). Throughout the remaining search, vlaps biases action chunk selection towards chunks similar to those output by the vla (\ref{['sec:biasing_selection']}).
  • Figure 3: Illustrative examples where vlaps succeeds while the underlying vla policy fails. The vla policy is Octo finetuned for $50$k steps on Libero demonstration data. In the top two tasks, the vla policy fails to grasp the target object, which prevents successful task completion, whereas vlaps successfully completes the pick-and-place. In the bottom task, the vla policy prematurely closes the drawer before retrieving the bowl, while vlaps correctly places the bowl inside before closing it.
  • Figure 4: Performance of vlaps and the vla-only policy in Libero, as a function of the number of finetuning steps of the underlying vla model. Top: Task success rate. Bottom: Mean algorithm runtime to complete each task. Runtimes are reported only for successful task evaluations. Failed vlaps evaluations consistently reach the $600$s search timeout. For each vla checkpoint, both vlaps and the vla-only policy are evaluated on $1000$ total tasks, drawn from five Libero task suites: Libero-Spatial, Libero-Goal, Libero-Object, Libero-10, and Libero-90. Each suite contributes 10 distinct tasks, each of which we test from ten different initial conditions.