VISTA: Enhancing Visual Conditioning via Track-Following Preference Optimization in Vision-Language-Action Models
Yiye Chen, Yanan Jian, Xiaoyi Dong, Shuxin Cao, Jing Wu, Patricio Vela, Benjamin E. Lundell, Dongdong Chen
TL;DR
VISTA tackles vision-action misalignment in Vision-Language-Action models by explicitly strengthening visual conditioning through track-following preference optimization, then transferring this grounding to instruction-following policies via latent-space distillation. The method introduces track-following Direct Preference Optimization (DPO) to align action predictions with visual tracks, followed by Latent Distillation during supervised fine-tuning to translate the improved grounding across architectures. Empirical results on LIBERO and CALVIN benchmarks show consistent gains in visual conditioning and task performance for both discrete OpenVLA and continuous OpenVLA-OFT settings, including a notable improvement on CALVIN ABC→D. The work demonstrates that stronger visual grounding leads to more reliable action outputs and points toward scalable, data-efficient ways to enhance visuomotor control in language-conditioned robots.
Abstract
Vision-Language-Action (VLA) models have demonstrated strong performance across a wide range of robotic manipulation tasks. Despite the success, extending large pretrained Vision-Language Models (VLMs) to the action space can induce vision-action misalignment, where action predictions exhibit weak dependence on the current visual state, leading to unreliable action outputs. In this work, we study VLA models through the lens of visual conditioning and empirically show that successful rollouts consistently exhibit stronger visual dependence than failed ones. Motivated by this observation, we propose a training framework that explicitly strengthens visual conditioning in VLA models. Our approach first aligns action prediction with visual input via preference optimization on a track-following surrogate task, and then transfers the enhanced alignment to instruction-following task through latent-space distillation during supervised finetuning. Without introducing architectural modifications or additional data collection, our method improves both visual conditioning and task performance for discrete OpenVLA, and further yields consistent gains when extended to the continuous OpenVLA-OFT setting. Project website: https://vista-vla.github.io/ .
