Table of Contents
Fetching ...

Vision-Language Models as a Source of Rewards

Kate Baumli, Satinder Baveja, Feryal Behbahani, Harris Chan, Gheorghe Comanici, Sebastian Flennerhag, Maxime Gazeau, Kristian Holsheimer, Dan Horgan, Michael Laskin, Clare Lyle, Hussain Masoom, Kay McKinney, Volodymyr Mnih, Alexander Neitz, Dmitry Nikulin, Fabio Pardo, Jack Parker-Holder, John Quan, Tim Rocktäschel, Himanshu Sahni, Tom Schaul, Yannick Schroecker, Stephen Spencer, Richie Steigerwald, Luyu Wang, Lei Zhang

TL;DR

This work demonstrates that off-the-shelf vision-language models, notably CLIP, can provide effective rewards for training language-conditioned RL agents in visual environments without domain-specific fine-tuning. It analyzes how reward quality scales with VLM size and how prompt design impacts agent performance, showing that larger models yield more accurate rewards and better ground-truth outcomes. The approach is validated in two visually rich domains (Playhouse and AndroidEnv) and reveals a strong correlation between maximizing the VLM-derived rewards and achieving ground-truth goals, highlighting practical potential for generalist RL. The findings suggest that advancing VLM capabilities could enable scalable, reward-driven learning for diverse, language-guided tasks.

Abstract

Building generalist agents that can accomplish many goals in rich open-ended environments is one of the research frontiers for reinforcement learning. A key limiting factor for building generalist agents with RL has been the need for a large number of reward functions for achieving different goals. We investigate the feasibility of using off-the-shelf vision-language models, or VLMs, as sources of rewards for reinforcement learning agents. We show how rewards for visual achievement of a variety of language goals can be derived from the CLIP family of models, and used to train RL agents that can achieve a variety of language goals. We showcase this approach in two distinct visual domains and present a scaling trend showing how larger VLMs lead to more accurate rewards for visual goal achievement, which in turn produces more capable RL agents.

Vision-Language Models as a Source of Rewards

TL;DR

This work demonstrates that off-the-shelf vision-language models, notably CLIP, can provide effective rewards for training language-conditioned RL agents in visual environments without domain-specific fine-tuning. It analyzes how reward quality scales with VLM size and how prompt design impacts agent performance, showing that larger models yield more accurate rewards and better ground-truth outcomes. The approach is validated in two visually rich domains (Playhouse and AndroidEnv) and reveals a strong correlation between maximizing the VLM-derived rewards and achieving ground-truth goals, highlighting practical potential for generalist RL. The findings suggest that advancing VLM capabilities could enable scalable, reward-driven learning for diverse, language-guided tasks.

Abstract

Building generalist agents that can accomplish many goals in rich open-ended environments is one of the research frontiers for reinforcement learning. A key limiting factor for building generalist agents with RL has been the need for a large number of reward functions for achieving different goals. We investigate the feasibility of using off-the-shelf vision-language models, or VLMs, as sources of rewards for reinforcement learning agents. We show how rewards for visual achievement of a variety of language goals can be derived from the CLIP family of models, and used to train RL agents that can achieve a variety of language goals. We showcase this approach in two distinct visual domains and present a scaling trend showing how larger VLMs lead to more accurate rewards for visual goal achievement, which in turn produces more capable RL agents.
Paper Structure (19 sections, 2 equations, 5 figures)

This paper contains 19 sections, 2 equations, 5 figures.

Figures (5)

  • Figure 1: Architecture for Vision-Language Models (VLMs) as rewards. The VLM trained contrastively contains an image encoder $f_\theta$ and language encoder $g_\theta$. We embed the current environment observation frame(s) using the image encoder, along with the desired goal language descriptions $l$ and negative language descriptions using the language encoder. The reward is computed by taking the cosine similarity scores and applying softmax and thresholding.
  • Figure 2: Environments and example tasks. (Left) Playhouse playhouse consists of Find, Lift, and Pick and Place tasks. (Right) AndroidEnv toyama2021androidenv consists of opening app tasks across various apps on Android.
  • Figure 3: Performance of an agent over the course of online reinforcement learning training when optimizing against the learned VLM reward. We measure both the (1) learned VLM reward return during training and (2) ground truth reward on held-out evaluation tasks. There is strong correlation between optimizing the learned VLM reward and the ground truth reward.
  • Figure 4: Scaling reward model size. (Left) Precision-Recall curves for varying VLM architecture and sizes on an offline fixed dataset of Playhouse trajectories. (Right) Ground truth returns on held-out evaluation tasks for Playhouse over the course of training with varying VLM reward sizes.
  • Figure 5: (Left) Prompt engineering effects on the AndroidEnv Open App task. More descriptive and specific prompts perform better when used as rewards. (Right) Prompt templates for the AndroidEnv Open App tasks.