Language Reward Modulation for Pretraining Reinforcement Learning

Ademi Adeniji; Amber Xie; Carmelo Sferrazza; Younggyo Seo; Stephen James; Pieter Abbeel

Language Reward Modulation for Pretraining Reinforcement Learning

Ademi Adeniji, Amber Xie, Carmelo Sferrazza, Younggyo Seo, Stephen James, Pieter Abbeel

TL;DR

The paper addresses the reward design bottleneck in reinforcement learning by proposing LAMP, a pretraining framework that uses a frozen vision-language model to generate language-conditioned intrinsic rewards for RL pretraining in a diverse, visually rich environment. By combining R3M-based rewards with Plan2Explore, LAMP biases exploration toward semantically meaningful affordances and yields a language-conditioned policy that can be finetuned efficiently on downstream tasks. Across RLBench tasks, LAMP improves sample efficiency and shows robustness to different language prompts and VLMs, highlighting the potential of VLMs as scalable pretraining signals for robotics. Limitations include the reliance on VLM inference speed and the need to extend to longer-horizon sequences; future work may explore faster VLMs and broader language-conditioned pretraining horizons.

Abstract

Using learned reward functions (LRFs) as a means to solve sparse-reward reinforcement learning (RL) tasks has yielded some steady progress in task-complexity through the years. In this work, we question whether today's LRFs are best-suited as a direct replacement for task rewards. Instead, we propose leveraging the capabilities of LRFs as a pretraining signal for RL. Concretely, we propose $\textbf{LA}$nguage Reward $\textbf{M}$odulated $\textbf{P}$retraining (LAMP) which leverages the zero-shot capabilities of Vision-Language Models (VLMs) as a $\textit{pretraining}$ utility for RL as opposed to a downstream task reward. LAMP uses a frozen, pretrained VLM to scalably generate noisy, albeit shaped exploration rewards by computing the contrastive alignment between a highly diverse collection of language instructions and the image observations of an agent in its pretraining environment. LAMP optimizes these rewards in conjunction with standard novelty-seeking exploration rewards with reinforcement learning to acquire a language-conditioned, pretrained policy. Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks in RLBench.

Language Reward Modulation for Pretraining Reinforcement Learning

TL;DR

Abstract

nguage Reward

odulated

retraining (LAMP) which leverages the zero-shot capabilities of Vision-Language Models (VLMs) as a

utility for RL as opposed to a downstream task reward. LAMP uses a frozen, pretrained VLM to scalably generate noisy, albeit shaped exploration rewards by computing the contrastive alignment between a highly diverse collection of language instructions and the image observations of an agent in its pretraining environment. LAMP optimizes these rewards in conjunction with standard novelty-seeking exploration rewards with reinforcement learning to acquire a language-conditioned, pretrained policy. Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks in RLBench.

Paper Structure (43 sections, 4 equations, 13 figures, 10 tables, 1 algorithm)

This paper contains 43 sections, 4 equations, 13 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Pretraining for RL
Inverse RL from human video
VLMs as task rewards
Background
Reinforcement learning
Reinforcement learning with vision-language reward
R3M
Method
Language Reward Modulation
R3M score as a reward
Rewards with diverse language prompts
Language-Conditioned Behavior Learning
Pretraining environment
...and 28 more sections

Figures (13)

Figure 1: LAMP Framework. Given a diverse set of tasks generated by hand or by a LLM, we extract VLM rewards for language-conditioned RL pretraining. At finetuning time, we condition the agent on the new task language embedding and train on the task reward.
Figure 2: LAMP Method. We use R3M r3m for our VLM-based rewards. We query the R3M score predictor for pixel and language alignment, which is pretrained on the Ego4D dataset ego4d. The reward model is frozen.
Figure 3: Video-Language alignment scores from R3M r3m, InternVideo internvideo, and ZeST zest on RLBench downstream tasks plotted over an expert episode with 3 snapshots visualized. Rewards are highly noisy and do not increase smoothly throughout the episode. Optimizing this signal with RL is unlikely to lead to stable solutions, and thus we instead use rewards as an exploration signal during pretraining.
Figure 4: We pretrain on domain-randomized environments based on Ego4D textures, occasionally sampling the default, non-randomized RLBench environment.
Figure 5: Finetuning performance on visual robotic manipulation tasks in RLBench. We provide the performance on additional tasks in the supplementary material. The solid line and shaded region represent mean and standard deviation across 3 seeds.
...and 8 more figures

Language Reward Modulation for Pretraining Reinforcement Learning

TL;DR

Abstract

Language Reward Modulation for Pretraining Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (13)