Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control
William Chen, Jagdeep Singh Bhatia, Catherine Glossop, Nikhil Mathihalli, Ria Doshi, Andy Tang, Danny Driess, Karl Pertsch, Sergey Levine
TL;DR
The paper tackles the challenge of grounding pretrained vision-language models in robotic control by introducing Steerable Policies, VLAs that accept a broad spectrum of steering commands across task-level, subtasks, motions, and grounded coordinates. It presents two hierarchical control methods to leverage VLM capabilities: (i) a learned embodied reasoner that produces reasoning traces and steering commands, and (ii) in-context reasoning with off-the-shelf VLMs that select steering abstractions adaptively. Through real-world Bridge WidowX experiments, the authors demonstrate superior generalization and long-horizon task performance over prior VLAs and hierarchical baselines, highlighting the value of policy steerability for transferring VLM reasoning, semantic knowledge, and in-context learning to robotics. The work suggests that richer, synthetic steering prompts and flexible abstraction levels are key to unlocking VLM capabilities in embodied agents, with potential extensions to learn affordances via reinforcement learning and cross-task in-context adaptation. Overall, Steerable Policies offer a scalable, architecture-agnostic path to integrate powerful foundation models with flexible, multi-level control for robust robotic manipulation.
Abstract
Pretrained vision-language models (VLMs) can make semantic and visual inferences across diverse settings, providing valuable common-sense priors for robotic control. However, effectively grounding this knowledge in robot behaviors remains an open challenge. Prior methods often employ a hierarchical approach where VLMs reason over high-level commands to be executed by separate low-level policies, e.g., vision-language-action models (VLAs). The interface between VLMs and VLAs is usually natural language task instructions, which fundamentally limits how much VLM reasoning can steer low-level behavior. We thus introduce Steerable Policies: VLAs trained on rich synthetic commands at various levels of abstraction, like subtasks, motions, and grounded pixel coordinates. By improving low-level controllability, Steerable Policies can unlock pretrained knowledge in VLMs, enabling improved task generalization. We demonstrate this benefit by controlling our Steerable Policies with both a learned high-level embodied reasoner and an off-the-shelf VLM prompted to reason over command abstractions via in-context learning. Across extensive real-world manipulation experiments, these two novel methods outperform prior embodied reasoning VLAs and VLM-based hierarchical baselines, including on challenging generalization and long-horizon tasks. Website: steerable-policies.github.io
