Table of Contents
Fetching ...

Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion

Zhuo Li, Junjia Liu, Zhipeng Dong, Tao Teng, Quentin Rouxel, Darwin Caldwell, Fei Chen

TL;DR

Problem addressed: deploying pre-trained Vision-Language-Action policies causes performance drops on downstream tasks. Main approach: VLA-Pilot provides plug-and-play inference-time policy steering by using Embodied Policy Steering Chain-of-Thought (EPS-CoT) with open-world Multimodal LLM verifiers and an Evolutionary Diffusion-based action optimizer, enabling zero-shot deployment without fine-tuning. Key contributions and findings: a training-free steering framework that infers a steering objective $R(a_t;c_t)$ and evolves action proposals, plus iterative refinement for closed-loop correction, validated across six tasks and two embodiments with substantial MSR gains—approaching fine-tuning performance with demonstrations. Significance: demonstrates data-efficient, scalable deployment of generalist VLA policies in varied robotic settings, with robust cross-embodiment generalization and strong zero-shot performance.

Abstract

Vision-Language-Action (VLA) models have demonstrated significant potential in real-world robotic manipulation. However, pre-trained VLA policies still suffer from substantial performance degradation during downstream deployment. Although fine-tuning can mitigate this issue, its reliance on costly demonstration collection and intensive computation makes it impractical in real-world settings. In this work, we introduce VLA-Pilot, a plug-and-play inference-time policy steering method for zero-shot deployment of pre-trained VLA without any additional fine-tuning or data collection. We evaluate VLA-Pilot on six real-world downstream manipulation tasks across two distinct robotic embodiments, encompassing both in-distribution and out-of-distribution scenarios. Experimental results demonstrate that VLA-Pilot substantially boosts the success rates of off-the-shelf pre-trained VLA policies, enabling robust zero-shot generalization to diverse tasks and embodiments. Experimental videos and code are available at: https://rip4kobe.github.io/vla-pilot/.

Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion

TL;DR

Problem addressed: deploying pre-trained Vision-Language-Action policies causes performance drops on downstream tasks. Main approach: VLA-Pilot provides plug-and-play inference-time policy steering by using Embodied Policy Steering Chain-of-Thought (EPS-CoT) with open-world Multimodal LLM verifiers and an Evolutionary Diffusion-based action optimizer, enabling zero-shot deployment without fine-tuning. Key contributions and findings: a training-free steering framework that infers a steering objective and evolves action proposals, plus iterative refinement for closed-loop correction, validated across six tasks and two embodiments with substantial MSR gains—approaching fine-tuning performance with demonstrations. Significance: demonstrates data-efficient, scalable deployment of generalist VLA policies in varied robotic settings, with robust cross-embodiment generalization and strong zero-shot performance.

Abstract

Vision-Language-Action (VLA) models have demonstrated significant potential in real-world robotic manipulation. However, pre-trained VLA policies still suffer from substantial performance degradation during downstream deployment. Although fine-tuning can mitigate this issue, its reliance on costly demonstration collection and intensive computation makes it impractical in real-world settings. In this work, we introduce VLA-Pilot, a plug-and-play inference-time policy steering method for zero-shot deployment of pre-trained VLA without any additional fine-tuning or data collection. We evaluate VLA-Pilot on six real-world downstream manipulation tasks across two distinct robotic embodiments, encompassing both in-distribution and out-of-distribution scenarios. Experimental results demonstrate that VLA-Pilot substantially boosts the success rates of off-the-shelf pre-trained VLA policies, enabling robust zero-shot generalization to diverse tasks and embodiments. Experimental videos and code are available at: https://rip4kobe.github.io/vla-pilot/.

Paper Structure

This paper contains 14 sections, 8 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Illustration of VLA policy steering. Prior methods enhance pre-trained VLA policies for downstream tasks through training-time policy fine-tuning. In contrast, we propose VLA-Pilot, an inference-time policy steering method that enables zero-shot deployment of pre-trained VLA policies without any additional fine-tuning or data collection.
  • Figure 2: Overview of VLA-Pilot. Given a task context, VLA-Pilot steers a pre-trained VLA policy at inference-time via three key steps: 1) Steering Objective Reasoning employs EPS-CoT module to reason a task-aligned steering objective reward from the given task context; 2) Action Proposal Optimization leverages Evolutionary Diffusion to score and optimize action proposals from the pre-trained VLA based on the reasoned objective reward, and executes the highest-scoring proposal; 3) Iterative Steering Refinement integrates post-execution reflection into the EPS-CoT reasoning loop, enabling closed-loop refinement for improved steering accuracy and robustness.
  • Figure 3: Embodied Policy Steering Chain-of-Thought. EPS-CoT guides the steering objective reasoning process through a structured CoT.
  • Figure 4: Truncated Diffusion-Denoising Process. VLA-Pilot employs a truncated diffusion-denoising mechanism to mutate elite proposals, thereby enhancing action diversity and exploration capabilities to achieve better task alignment.
  • Figure 5: Qualitative results of real robot experiments. VLA-Pilot effectively steers off-the-shelf pre-trained VLA policies to complete downstream tasks at inference time, achieving zero-shot deployment across both ID and OOD task scenarios.
  • ...and 3 more figures