Table of Contents
Fetching ...

Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control

William Chen, Jagdeep Singh Bhatia, Catherine Glossop, Nikhil Mathihalli, Ria Doshi, Andy Tang, Danny Driess, Karl Pertsch, Sergey Levine

TL;DR

The paper tackles the challenge of grounding pretrained vision-language models in robotic control by introducing Steerable Policies, VLAs that accept a broad spectrum of steering commands across task-level, subtasks, motions, and grounded coordinates. It presents two hierarchical control methods to leverage VLM capabilities: (i) a learned embodied reasoner that produces reasoning traces and steering commands, and (ii) in-context reasoning with off-the-shelf VLMs that select steering abstractions adaptively. Through real-world Bridge WidowX experiments, the authors demonstrate superior generalization and long-horizon task performance over prior VLAs and hierarchical baselines, highlighting the value of policy steerability for transferring VLM reasoning, semantic knowledge, and in-context learning to robotics. The work suggests that richer, synthetic steering prompts and flexible abstraction levels are key to unlocking VLM capabilities in embodied agents, with potential extensions to learn affordances via reinforcement learning and cross-task in-context adaptation. Overall, Steerable Policies offer a scalable, architecture-agnostic path to integrate powerful foundation models with flexible, multi-level control for robust robotic manipulation.

Abstract

Pretrained vision-language models (VLMs) can make semantic and visual inferences across diverse settings, providing valuable common-sense priors for robotic control. However, effectively grounding this knowledge in robot behaviors remains an open challenge. Prior methods often employ a hierarchical approach where VLMs reason over high-level commands to be executed by separate low-level policies, e.g., vision-language-action models (VLAs). The interface between VLMs and VLAs is usually natural language task instructions, which fundamentally limits how much VLM reasoning can steer low-level behavior. We thus introduce Steerable Policies: VLAs trained on rich synthetic commands at various levels of abstraction, like subtasks, motions, and grounded pixel coordinates. By improving low-level controllability, Steerable Policies can unlock pretrained knowledge in VLMs, enabling improved task generalization. We demonstrate this benefit by controlling our Steerable Policies with both a learned high-level embodied reasoner and an off-the-shelf VLM prompted to reason over command abstractions via in-context learning. Across extensive real-world manipulation experiments, these two novel methods outperform prior embodied reasoning VLAs and VLM-based hierarchical baselines, including on challenging generalization and long-horizon tasks. Website: steerable-policies.github.io

Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control

TL;DR

The paper tackles the challenge of grounding pretrained vision-language models in robotic control by introducing Steerable Policies, VLAs that accept a broad spectrum of steering commands across task-level, subtasks, motions, and grounded coordinates. It presents two hierarchical control methods to leverage VLM capabilities: (i) a learned embodied reasoner that produces reasoning traces and steering commands, and (ii) in-context reasoning with off-the-shelf VLMs that select steering abstractions adaptively. Through real-world Bridge WidowX experiments, the authors demonstrate superior generalization and long-horizon task performance over prior VLAs and hierarchical baselines, highlighting the value of policy steerability for transferring VLM reasoning, semantic knowledge, and in-context learning to robotics. The work suggests that richer, synthetic steering prompts and flexible abstraction levels are key to unlocking VLM capabilities in embodied agents, with potential extensions to learn affordances via reinforcement learning and cross-task in-context adaptation. Overall, Steerable Policies offer a scalable, architecture-agnostic path to integrate powerful foundation models with flexible, multi-level control for robust robotic manipulation.

Abstract

Pretrained vision-language models (VLMs) can make semantic and visual inferences across diverse settings, providing valuable common-sense priors for robotic control. However, effectively grounding this knowledge in robot behaviors remains an open challenge. Prior methods often employ a hierarchical approach where VLMs reason over high-level commands to be executed by separate low-level policies, e.g., vision-language-action models (VLAs). The interface between VLMs and VLAs is usually natural language task instructions, which fundamentally limits how much VLM reasoning can steer low-level behavior. We thus introduce Steerable Policies: VLAs trained on rich synthetic commands at various levels of abstraction, like subtasks, motions, and grounded pixel coordinates. By improving low-level controllability, Steerable Policies can unlock pretrained knowledge in VLMs, enabling improved task generalization. We demonstrate this benefit by controlling our Steerable Policies with both a learned high-level embodied reasoner and an off-the-shelf VLM prompted to reason over command abstractions via in-context learning. Across extensive real-world manipulation experiments, these two novel methods outperform prior embodied reasoning VLAs and VLM-based hierarchical baselines, including on challenging generalization and long-horizon tasks. Website: steerable-policies.github.io
Paper Structure (36 sections, 19 figures, 3 tables)

This paper contains 36 sections, 19 figures, 3 tables.

Figures (19)

  • Figure 1: The hierarchical policy inference loop, where a high-level model sends commands to the low-level Steerable Policy.
  • Figure 2: Our automated pipeline for annotating robot data with synthetic steering commands at scale. 1: We use a suite of foundation models to extract subtasks and grounded features (bounding boxes, motions, and gripper traces) from each trajectory. 2: We query a VLM to generate diverse steering commands for training Steerable Policies. These commands may reference features extracted in the first step, which we provide in the prompt. 3: To train high-level embodied reasoners, we also generate rationalizations for why particular commands are appropriate for given observations (\ref{['subsec:method-learned-embodied-reasoners']}).
  • Figure 3: Our two novel high-level policies (\ref{['sec:high-level-methods']}). (a) fine-tunes a VLM into an embodied reasoner that issues steering commands, while (b) queries an off-the-shelf VLM to determine appropriate commands via in-context reasoning.
  • Figure 4: Interactive interface for querying humans for oracle steering commands. 1: The operator can interrupt the rollout to issue a new steering command. To facilitate giving commands with pixel coordinates, they can add textual placeholder markers. 2: If any are given, a GUI is opened displaying the current robot observation, allowing the user to click to fill the markers in. 3: Finally, the rollout resumes with the new command.
  • Figure 6: Our approach of controlling our Steerable Policy with a learned high-level embodied reasoning model outperforms four baselines: the equivalent standard Bridge OpenVLA Kim24-openVLA, the Reasoning Pretraining and Dropout ECoT-Lite methods Chen25-ecot-lite, and full Embodied Chain-of-Thought Reasoning Zawalski24-ecot. Error bars denote $\pm 1$StdErr. We adopt the same Bridge task suite as ECoT-Lite Chen25-ecot-lite, enabling a direct comparison with other methods for training VLAs with embodied reasoning data.
  • ...and 14 more figures