VLS: Steering Pretrained Robot Policies via Vision-Language Models

Shuo Liu; Ishneet Sukhvinder Singh; Yiqing Xu; Jiafei Duan; Ranjay Krishna

VLS: Steering Pretrained Robot Policies via Vision-Language Models

Shuo Liu, Ishneet Sukhvinder Singh, Yiqing Xu, Jiafei Duan, Ranjay Krishna

TL;DR

This work tackles the brittleness of pretrained diffusion and flow-matching robotic policies under test-time observation and instruction shifts. It introduces Vision-Language Steering (VLS), a training-free framework that steers the sampling of a frozen policy at inference time by grounding OOD inputs into spatial constraints and generating stage-aware differentiable rewards with vision-language models. Through gradient guidance, repulsive diversity initialization, and Feynman–Kac resampling, VLS achieves strong adaptation without fine-tuning, outperforming baselines on CALVIN and LIBERO-PRO and enabling real-world deployment on a Franka robot. The approach demonstrates that inference-time control can effectively reuse existing skills across spatial and semantic variations, reducing the need for extensive retraining.

Abstract

Why do pretrained diffusion or flow-matching policies fail when the same task is performed near an obstacle, on a shifted support surface, or amid mild clutter? Such failures rarely reflect missing motor skills; instead, they expose a limitation of imitation learning under train-test shifts, where action generation is tightly coupled to training-specific spatial configurations and task specifications. Retraining or fine-tuning to address these failures is costly and conceptually misaligned, as the required behaviors already exist but cannot be selectively adapted at test time. We propose Vision-Language Steering (VLS), a training-free framework for inference-time adaptation of frozen generative robot policies. VLS treats adaptation as an inference-time control problem, steering the sampling process of a pretrained diffusion or flow-matching policy in response to out-of-distribution observation-language inputs without modifying policy parameters. By leveraging vision-language models to synthesize trajectory-differentiable reward functions, VLS guides denoising toward action trajectories that satisfy test-time spatial and task requirements. Across simulation and real-world evaluations, VLS consistently outperforms prior steering methods, achieving a 31% improvement on CALVIN and a 13% gain on LIBERO-PRO. Real-world deployment on a Franka robot further demonstrates robust inference-time adaptation under test-time spatial and semantic shifts. Project page: https://vision-language-steering.github.io/webpage/

VLS: Steering Pretrained Robot Policies via Vision-Language Models

TL;DR

Abstract

Paper Structure (24 sections, 11 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 24 sections, 11 equations, 5 figures, 1 table, 1 algorithm.

Introduction
Related Work
Imitation-Trained Policies under Small Environment Shifts
VLM-based Scene Understanding with Re-optimization
Inference-time Steering of Generative Policies
Problem Formulation
The OOD Dilemma in Imitation Learning
Diffusion and Flow Matching Policies
Problem Formulation
Our Approach: VLS
OOD Input Grounding and Reward Generation
OOD Input Grounding
Programmatic Reward Generation
Action Denoising Process Guidance
Diverse Proposal Initialization with Repulsive Forces
...and 9 more sections

Figures (5)

Figure 1: We present Vision--Language Steering (VLS), a training-free framework for inference-time steering of frozen generative robot policies. Our core idea is to leverage the open-world understanding capabilities of VLMs to generate reward functions for partially denoised action proposals, helping the base policy successfully operate in out-of-distribution (OOD) scenarios such as object changes, scene changes or instruction changes by correcting the denoising path. VLS demonstrates excellent performance in simulation benchmarks as well as real-world experiments, proving its effectiveness.
Figure 2: VLS pipeline overview. At environment time step $t$, given RGB-D observation $o_t$ and language instruction $l$, VLS firstly utilize the Segment Anything Model (SAM sam) and DINOv2 dino feature to ground condition into a set of spatial keypoints $\mathcal{P}$. Subsequently, a Vision-Language Model will be queried to generates a series of stage-aware differentiable programmatic reward functions $\{\mathcal{R}_s\}_{s=1}^S$, based on observation, task instruction and keypoints, which are used to guide the action generation process of the frozen base policy $\pi^\star$: during the denoising sampling loop, the system precisely corrects action trajectories by injecting reward gradients, incorporating RBF tdp repulsion terms and a Feynman–Kac singhal2025general based resampling mechanism to rapidly converge to high-reward regions while maintaining sampling diversity. Finally, VLS constructs a closed-loop stage switching system based on reward feedback, utilizing adaptive guidance strength and Schmitt-trigger schmitt1938thermionic switching logic to monitor execution progress, thereby automatically triggering phase transitions or retry strategies when facing physical uncertainties (such as object displacement or manipulation failures), ensuring robust completion of long-horizon manipulation tasks in OOD environments.
Figure 3: Steering methods comparison on CALVIN. Success rates for VLS (ours), DynaGuide, ITPS, and the base diffusion policy across movable objects (cubes) and articulated parts (drawer, switch, button, door). VLS achieves 94% average on movable objects (7.4$\times$ over base policy) and 87% on articulated parts (9.6$\times$ boost), outperforming prior steering methods by 15--25 percentage points. Error bars show standard deviation over 600 episodes per task.
Figure 4: (left) Ablation of VLS components (50 episodes per task). We compare Full VLS (gradient guidance + FK steering + RBF diversity, with $K=10$) against variants that remove FK steering (w/o FKD), remove RBF diversity (w/o RBF), or remove gradient guidance (w/o grad). (right) Scaling with sample batch size $K$ on door_left (50 episodes). Larger $K$ improves performance but increases inference time, illustrating a compute--performance tradeoff.
Figure 5: Real-world Deployment on a Franka robot. (Left: In-distribution tasks) Task layouts, language instructions, and success rates for in-distribution real-world manipulation. Level 1 (top) requires placing an orange onto a specified plate (red or green) based on the instruction. Level 2 (bottom) introduces an additional object (banana), requiring sequential selection of both the target object and the target plate. Bar plots report per-task and average success rates for the frozen $\pi$-0.5 baseline and VLS. (Right: Out-of-distribution tasks) Task layouts, instructions, and results under test-time distribution shifts. We evaluate three OOD variants: (1) Appearance shift (top), replacing the red/green plate with a previously unseen yellow plate; (2) Position shift (middle), swapping the locations of the two plates while keeping the instruction unchanged; (3) Object shift (bottom), replacing the banana with a never-before-seen mug and instructing the robot to place the mug on the green plate. Each task is evaluated over 20 trials. Grasping the correct object contributes 50% success, and full task completion contributes 100%. VLS consistently outperforms the baseline and maintains robust execution under real-world OOD conditions.

VLS: Steering Pretrained Robot Policies via Vision-Language Models

TL;DR

Abstract

VLS: Steering Pretrained Robot Policies via Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)