Table of Contents
Fetching ...

VISTA: Enhancing Visual Conditioning via Track-Following Preference Optimization in Vision-Language-Action Models

Yiye Chen, Yanan Jian, Xiaoyi Dong, Shuxin Cao, Jing Wu, Patricio Vela, Benjamin E. Lundell, Dongdong Chen

TL;DR

VISTA tackles vision-action misalignment in Vision-Language-Action models by explicitly strengthening visual conditioning through track-following preference optimization, then transferring this grounding to instruction-following policies via latent-space distillation. The method introduces track-following Direct Preference Optimization (DPO) to align action predictions with visual tracks, followed by Latent Distillation during supervised fine-tuning to translate the improved grounding across architectures. Empirical results on LIBERO and CALVIN benchmarks show consistent gains in visual conditioning and task performance for both discrete OpenVLA and continuous OpenVLA-OFT settings, including a notable improvement on CALVIN ABC→D. The work demonstrates that stronger visual grounding leads to more reliable action outputs and points toward scalable, data-efficient ways to enhance visuomotor control in language-conditioned robots.

Abstract

Vision-Language-Action (VLA) models have demonstrated strong performance across a wide range of robotic manipulation tasks. Despite the success, extending large pretrained Vision-Language Models (VLMs) to the action space can induce vision-action misalignment, where action predictions exhibit weak dependence on the current visual state, leading to unreliable action outputs. In this work, we study VLA models through the lens of visual conditioning and empirically show that successful rollouts consistently exhibit stronger visual dependence than failed ones. Motivated by this observation, we propose a training framework that explicitly strengthens visual conditioning in VLA models. Our approach first aligns action prediction with visual input via preference optimization on a track-following surrogate task, and then transfers the enhanced alignment to instruction-following task through latent-space distillation during supervised finetuning. Without introducing architectural modifications or additional data collection, our method improves both visual conditioning and task performance for discrete OpenVLA, and further yields consistent gains when extended to the continuous OpenVLA-OFT setting. Project website: https://vista-vla.github.io/ .

VISTA: Enhancing Visual Conditioning via Track-Following Preference Optimization in Vision-Language-Action Models

TL;DR

VISTA tackles vision-action misalignment in Vision-Language-Action models by explicitly strengthening visual conditioning through track-following preference optimization, then transferring this grounding to instruction-following policies via latent-space distillation. The method introduces track-following Direct Preference Optimization (DPO) to align action predictions with visual tracks, followed by Latent Distillation during supervised fine-tuning to translate the improved grounding across architectures. Empirical results on LIBERO and CALVIN benchmarks show consistent gains in visual conditioning and task performance for both discrete OpenVLA and continuous OpenVLA-OFT settings, including a notable improvement on CALVIN ABC→D. The work demonstrates that stronger visual grounding leads to more reliable action outputs and points toward scalable, data-efficient ways to enhance visuomotor control in language-conditioned robots.

Abstract

Vision-Language-Action (VLA) models have demonstrated strong performance across a wide range of robotic manipulation tasks. Despite the success, extending large pretrained Vision-Language Models (VLMs) to the action space can induce vision-action misalignment, where action predictions exhibit weak dependence on the current visual state, leading to unreliable action outputs. In this work, we study VLA models through the lens of visual conditioning and empirically show that successful rollouts consistently exhibit stronger visual dependence than failed ones. Motivated by this observation, we propose a training framework that explicitly strengthens visual conditioning in VLA models. Our approach first aligns action prediction with visual input via preference optimization on a track-following surrogate task, and then transfers the enhanced alignment to instruction-following task through latent-space distillation during supervised finetuning. Without introducing architectural modifications or additional data collection, our method improves both visual conditioning and task performance for discrete OpenVLA, and further yields consistent gains when extended to the continuous OpenVLA-OFT setting. Project website: https://vista-vla.github.io/ .
Paper Structure (49 sections, 8 equations, 8 figures, 6 tables)

This paper contains 49 sections, 8 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: VISTA Overview. We align VLA action outputs to visual tracks via preference optimization, followed by supervised finetuning with latent distillation. Our experiments show that VISTA enhances visual conditioning and improves performance.
  • Figure 2: Visual Conditioning of the 8-step OpenVLA and VISTA (Ours) in LIBERO-Spatial. The periodic vertical grids indicate that each block of seven tokens decodes to a single action, leading to 56 output tokens for 8 actions.
  • Figure 3: VISTA Methodology. Starting from a vanilla instruction-following SFT model, we apply DPO on track-following preference samples constructed from the instruction-following dataset to align action prediction with visual input. We then transfer the alignment to the instruction-following policy via latent distillation, resulting in enhanced visual conditioning and VLA performance.
  • Figure 4: Illustration of benchmarks.
  • Figure 5: Analysis and Ablation on LIBERO-Spatial. (A) Change of visual conditioning during the VISTA training stages; (B) Ablation on the visual conditioning and VLA performance of alternative training strategies.
  • ...and 3 more figures