Table of Contents
Fetching ...

VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models

Christos Ziakas, Alessandra Russo

TL;DR

VITA introduces a zero-shot value function estimator that leverages test-time adaptation of a frozen vision-language model to encode semantic and temporal context from trajectories. By updating a lightweight adapter at each timestep with a meta-learned self-supervised loss, VITA captures history without task-specific demonstrations and mitigates shortcut learning via dissimilarity-based sampling. The method achieves robust generalization across distribution shifts, distinguishes expert from non-expert trajectories, and provides effective zero-shot reward shaping for offline RL, outperforming autoregressive VLM baselines. These findings demonstrate practical benefits for real-world robotic manipulation and multi-task offline RL, illustrating a scalable approach to temporal- and task-generalizable value estimation.

Abstract

Vision-Language Models (VLMs) show promise as zero-shot goal-conditioned value functions, but their frozen pre-trained representations limit generalization and temporal reasoning. We introduce VITA, a zero-shot value function learning method that enhances both capabilities via test-time adaptation. At inference, a lightweight adaptation module is updated via a gradient step on a meta-learned self-supervised loss, such that each test-time update improves value estimation. By updating sequentially over a trajectory, VITA encodes history into its parameters, addressing the temporal reasoning limitations. To mitigate shortcut learning, we propose a dissimilarity-based sampling strategy that selects semantically diverse segments of the trajectory during training. In real-world robotic manipulation tasks, VITA generalizes from a single training environment to diverse out-of-distribution tasks, environments, and embodiments, outperforming the state-of-the-art zero-shot method using autoregressive VLMs. Furthermore, we demonstrate that VITA's zero-shot value estimates can be utilized for reward shaping in offline reinforcement learning, resulting in multi-task policies on the Meta-World benchmark that exceed the performance of those trained with the simulation's fuzzy-logic dense rewards.

VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models

TL;DR

VITA introduces a zero-shot value function estimator that leverages test-time adaptation of a frozen vision-language model to encode semantic and temporal context from trajectories. By updating a lightweight adapter at each timestep with a meta-learned self-supervised loss, VITA captures history without task-specific demonstrations and mitigates shortcut learning via dissimilarity-based sampling. The method achieves robust generalization across distribution shifts, distinguishes expert from non-expert trajectories, and provides effective zero-shot reward shaping for offline RL, outperforming autoregressive VLM baselines. These findings demonstrate practical benefits for real-world robotic manipulation and multi-task offline RL, illustrating a scalable approach to temporal- and task-generalizable value estimation.

Abstract

Vision-Language Models (VLMs) show promise as zero-shot goal-conditioned value functions, but their frozen pre-trained representations limit generalization and temporal reasoning. We introduce VITA, a zero-shot value function learning method that enhances both capabilities via test-time adaptation. At inference, a lightweight adaptation module is updated via a gradient step on a meta-learned self-supervised loss, such that each test-time update improves value estimation. By updating sequentially over a trajectory, VITA encodes history into its parameters, addressing the temporal reasoning limitations. To mitigate shortcut learning, we propose a dissimilarity-based sampling strategy that selects semantically diverse segments of the trajectory during training. In real-world robotic manipulation tasks, VITA generalizes from a single training environment to diverse out-of-distribution tasks, environments, and embodiments, outperforming the state-of-the-art zero-shot method using autoregressive VLMs. Furthermore, we demonstrate that VITA's zero-shot value estimates can be utilized for reward shaping in offline reinforcement learning, resulting in multi-task policies on the Meta-World benchmark that exceed the performance of those trained with the simulation's fuzzy-logic dense rewards.

Paper Structure

This paper contains 41 sections, 6 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Overview of VITA.VITA learns a goal-conditioned value function via meta-learning and achieves zero-shot generalization to out-of-distribution trajectories via test-time adaptation.
  • Figure 2: Test-time Adaptation. In inference, at each timestep $t$, an adaptation module $f_{\text{adapt}}$ is updated via a gradient step on a meta-learned self-supervised loss $\ell_{\text{self}}$, encoding temporal history.
  • Figure 3: Examples of visual trajectories paired with task descriptions under different distribution shifts. (a) In-distribution. (b, c) Environment shift. (d) Embodiment and environment shift.
  • Figure 4: Each subfigure shows the start and end frames from an expert demonstration used for training, along with its natural language task description. Demonstrations are collected across four distinct ToyKitchen environments.
  • Figure 5: Each subfigure shows the start and end frames from an evaluation trajectory under embodiment shift, along with its natural language task description. The top row depicts tasks in the same environment (ToyKitchen) using a different robot (DeepThought), while the bottom row includes tasks that also involve new environments.
  • ...and 2 more figures