Code as Reward: Empowering Reinforcement Learning with VLMs

David Venuto; Sami Nur Islam; Martin Klissarov; Doina Precup; Sherry Yang; Ankit Anand

Code as Reward: Empowering Reinforcement Learning with VLMs

David Venuto, Sami Nur Islam, Martin Klissarov, Doina Precup, Sherry Yang, Ankit Anand

TL;DR

The paper tackles the challenge of reward specification in reinforcement learning by leveraging pre-trained Vision-Language Models to synthesize executable sub-task and reward programs. Through a three-stage pipeline—generate, verify, and integrate—the approach yields dense, interpretable rewards and a verification mechanism using expert and random trajectories. Empirical results across MiniGrid, Pandas-Gym, and CLIPort show that VLM-CaR rewards can outperform sparse environment rewards, accelerating learning and enabling complex tasks to be learned from image-based observations. This work demonstrates a promising direction for scalable, verifiable reward design, with potential extensions toward end-to-end automation and hierarchical RL frameworks.

Abstract

Pre-trained Vision-Language Models (VLMs) are able to understand visual concepts, describe and decompose complex tasks into sub-tasks, and provide feedback on task completion. In this paper, we aim to leverage these capabilities to support the training of reinforcement learning (RL) agents. In principle, VLMs are well suited for this purpose, as they can naturally analyze image-based observations and provide feedback (reward) on learning progress. However, inference in VLMs is computationally expensive, so querying them frequently to compute rewards would significantly slowdown the training of an RL agent. To address this challenge, we propose a framework named Code as Reward (VLM-CaR). VLM-CaR produces dense reward functions from VLMs through code generation, thereby significantly reducing the computational burden of querying the VLM directly. We show that the dense rewards generated through our approach are very accurate across a diverse set of discrete and continuous environments, and can be more effective in training RL policies than the original sparse environment rewards.

Code as Reward: Empowering Reinforcement Learning with VLMs

TL;DR

Abstract

Paper Structure (26 sections, 4 figures, 3 tables)

This paper contains 26 sections, 4 figures, 3 tables.

Introduction
Related Work
Vision-Language Models
VLMs as Reward Models
Code as Policies and Rewards
Preliminaries
Proposed approach
Generating Rewards and Sub-tasks
Verification using Expert and Random trajectories
Using Generated Programs in the RL loop
Experiments
Experimental Procedure
Manual Steps
MiniGrid
Pandas-Gym
...and 11 more sections

Figures (4)

Figure 1: Complete pipeline of VLM-CaR, describing how code blocks for sub-tasks and rewards are generated. The top portion is the reward script generation pipeline, which uses the VLM, and the bottom portion is the RL training loop. The feedback loop is shown on the right and is used to determine if the task and goal code blocks are correct. The middle portion in green represents the generated scripts from the VLM. The task completion scripts are applied to random and expert trajectories to compute if the task was completed or not. All tasks should be completed in expert trajectories and rarely completed in random trajectories.
Figure 2: The online episodic mean reward evaluated over 5 episodes every 250 steps for MiniGrid RL tasks. We show the average over $3$ random seeds. $1M$ environment step interactions are used. The shaded area shows the standard error. Agents trained using rewards generated by VLM-CaR perform better than the sparse environment reward. In some tasks, sparse rewards are not sufficient for any meaningful performance whereas VLM-CaR rewards allow the agent to solve the task.
Figure 3: The success rate in completing the final task in Pandas-Gym environments. $5$ random seeds are shown. The shaded area is the standard deviation. RL agent trained on dense reward generated by VLM-CaR generally performs better than RL agent trained on sparse environment rewards.
Figure 4: The inferred reward using VLM-CaR for 4 policies (random, novice, sub-optimal, expert) on different CLIPort environments. CLIPort is trained using imitation learning so we evaluate or reward function on 4 policies of varying skill levels. An action is taken randomly 30% of the time in the suboptimal policy and 50% of the time in the novice policy. The reward inferred by VLM-CaR well reflects the training process and performance of different policies.

Code as Reward: Empowering Reinforcement Learning with VLMs

TL;DR

Abstract

Code as Reward: Empowering Reinforcement Learning with VLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (4)