VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models
Christos Ziakas, Alessandra Russo
TL;DR
VITA introduces a zero-shot value function estimator that leverages test-time adaptation of a frozen vision-language model to encode semantic and temporal context from trajectories. By updating a lightweight adapter at each timestep with a meta-learned self-supervised loss, VITA captures history without task-specific demonstrations and mitigates shortcut learning via dissimilarity-based sampling. The method achieves robust generalization across distribution shifts, distinguishes expert from non-expert trajectories, and provides effective zero-shot reward shaping for offline RL, outperforming autoregressive VLM baselines. These findings demonstrate practical benefits for real-world robotic manipulation and multi-task offline RL, illustrating a scalable approach to temporal- and task-generalizable value estimation.
Abstract
Vision-Language Models (VLMs) show promise as zero-shot goal-conditioned value functions, but their frozen pre-trained representations limit generalization and temporal reasoning. We introduce VITA, a zero-shot value function learning method that enhances both capabilities via test-time adaptation. At inference, a lightweight adaptation module is updated via a gradient step on a meta-learned self-supervised loss, such that each test-time update improves value estimation. By updating sequentially over a trajectory, VITA encodes history into its parameters, addressing the temporal reasoning limitations. To mitigate shortcut learning, we propose a dissimilarity-based sampling strategy that selects semantically diverse segments of the trajectory during training. In real-world robotic manipulation tasks, VITA generalizes from a single training environment to diverse out-of-distribution tasks, environments, and embodiments, outperforming the state-of-the-art zero-shot method using autoregressive VLMs. Furthermore, we demonstrate that VITA's zero-shot value estimates can be utilized for reward shaping in offline reinforcement learning, resulting in multi-task policies on the Meta-World benchmark that exceed the performance of those trained with the simulation's fuzzy-logic dense rewards.
