Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion
Zhuo Li, Junjia Liu, Zhipeng Dong, Tao Teng, Quentin Rouxel, Darwin Caldwell, Fei Chen
TL;DR
Problem addressed: deploying pre-trained Vision-Language-Action policies causes performance drops on downstream tasks. Main approach: VLA-Pilot provides plug-and-play inference-time policy steering by using Embodied Policy Steering Chain-of-Thought (EPS-CoT) with open-world Multimodal LLM verifiers and an Evolutionary Diffusion-based action optimizer, enabling zero-shot deployment without fine-tuning. Key contributions and findings: a training-free steering framework that infers a steering objective $R(a_t;c_t)$ and evolves action proposals, plus iterative refinement for closed-loop correction, validated across six tasks and two embodiments with substantial MSR gains—approaching fine-tuning performance with demonstrations. Significance: demonstrates data-efficient, scalable deployment of generalist VLA policies in varied robotic settings, with robust cross-embodiment generalization and strong zero-shot performance.
Abstract
Vision-Language-Action (VLA) models have demonstrated significant potential in real-world robotic manipulation. However, pre-trained VLA policies still suffer from substantial performance degradation during downstream deployment. Although fine-tuning can mitigate this issue, its reliance on costly demonstration collection and intensive computation makes it impractical in real-world settings. In this work, we introduce VLA-Pilot, a plug-and-play inference-time policy steering method for zero-shot deployment of pre-trained VLA without any additional fine-tuning or data collection. We evaluate VLA-Pilot on six real-world downstream manipulation tasks across two distinct robotic embodiments, encompassing both in-distribution and out-of-distribution scenarios. Experimental results demonstrate that VLA-Pilot substantially boosts the success rates of off-the-shelf pre-trained VLA policies, enabling robust zero-shot generalization to diverse tasks and embodiments. Experimental videos and code are available at: https://rip4kobe.github.io/vla-pilot/.
